TUTORIALS

For the first time, SLTU will offer one day Interactive Tutorial (incl. Experiments) on ASR-Kaldi and TTS-Festival for Under Resourced Languages :

                           

BUILDING SPEECH RECOGNITION SYSTEMS WITH THE KALDI TOOLKIT (ASR-KALDI) TUTORIAL
Team:
  • Sanjeev Khudanpur (JHU, USA)
  • Jan Trmal (JHU, USA)
  • Vijayaditya Peddinti (JHU,USA)
  • Sarah Samson Juan (Unimas, Malaysia)
  • Dessi Puji Lestari (ITB, Indonesia)

Overview:
This tutorial will show participants how to build state-of-the-art automatic speech recognition (ASR) systems using the Kaldi open-source tools, with particular emphasis on ASR systems for low-resource languages. Participants will be provided a basic language pack for Iban (a Malayic language), including transcribed speech, a pronunciation lexicon and additional text for language modeling. They will be shown how to preprocess and organize this training data into a format required by Kaldi. They will then be shown how to build a sequence of increasingly complex Gaussian mixture model (GMM) based acoustic phonetic models. These models will then be used to bootstrap the training of deep neural network (DNN) based acoustic models via cross-entropy training. Concurrently, they will learn how to build and optimize n-gram language models. Finally, they will be shown how to prepare the weighted finite state transducer (WFST) representations of the language model (G), lexicon (C), context-dependency trees (C) and acoustic models (H), to compose them into a decoding graph (HCLG), and to determinize and minimize it to obtain a static decoding graph. Participants will evaluate the efficacy of their ASR systems by decoding a held-out (test) data set, and utilize standard scoring tools to generate performance statistics such as word error rate.

The first (morning) lecture session will go over the general ASR concepts and Kaldi-specific information needed to perform the sequence of steps described above. They will then utilize the laboratory session(s) to carry out the steps, building and evaluating their own individual ASR systems, beginning before lunch (1 hour) and continuing after lunch (3 hours). The second (afternoon) lecture will regroup to analyze the experience of the participants, and answer questions of general interest that participants encountered while building their ASR systems; active participation in this discussion will be expected. Finally, the lecture will conclude by touching upon advanced topics not covered in the basic system-building tutorial, including speech segmentation (VAD), lexicon augmentation (G2P), sequence (MMI) training of DNNs, and other topics as time permits.


Biography:

Sanjeev Khudanpur received a B.Tech in Electrical Engineering from the Indian Institute of Technology, Bombay, in 1988, and a Ph.D in Electrical Engineering from the University of Maryland, College Park, in 1997. His doctoral dissertation was supervised by Prof. Prakash Narayan and was titled Model Selection and Universal Data Compression. Since 1996, he has been on the faculty of the Johns Hopkins University. Until June 2001, he was an Associate Research Scientist in the Center for Language and Speech Processing and, from July 2001 to June 2008, an Assistant Professor in the Department of Electrical and Computer Engineering and the Department of Computer Science; he became an Associate Professor in July 2008.

He is a founding member of the Johns Hopkins University Human Language Technology Center of Excellence, member of the Center for Language and Speech Processing, and member of the steering committee of the Johns Hopkins University Science of Learning Institute. He is interested in the application of information theoretic methods to human language technologies such as automatic speech recognition, machine translation and natural language processing. All these technologies make heavy use of statistical models of human language. He is interested in understanding the structure of such models and in estimating their parameters from data. He organizes the annual Johns Hopkins Summer Workshops to advance the greater research agenda of this field.

Website: http://www.clsp.jhu.edu/~sanjeev/

Schedule:
09:00-10:30 Tutorial Part I

  • Kaldi introduction.
  • Kaldi Experiments Start.
10:30-11:00Coffee Break
11:00-12:00

Tutorial Part II

  • Kaldi Experiments.
12:00-13:30 Lunch
13:30-15:30

Tutorial Part III

  • Kaldi Experiments.
15:30-16:00 Coffee Break
16:00-17:00

Tutorial Part IV

  • Experiments Discussion.
17:00-18:30

Tutorial Part V

  • Advanced Issues.

Target participants, and prerequisites:

  • This tutorial is intended for participants who want to engage in speech research, such as beginning students in Masters and PhD programs, who wish to develop strong baseline ASR systems for evaluating new ideas.  Application developers in industry will find the tutorial useful for gaining insights into the capabilities of the Kaldi tools, but technical considerations such as using cloud computing to scale up to large data sets, on-line decoding, etc. will not be addressed.

  • Participants should be proficient in computer programming, specifically in bash and perl scripting and unix command-line utilities.  Knowledge of C/C++ and advanced unix utilities, etc. will be a plus.  Some basic understanding of automatic speech recognition (e.g. hidden Markov models and artificial neural networks) is also expected.





SPEECH SYNTHESIS AND LOW RESOURCE LANGUAGES (TTS-FESTIVAL) TUTORIAL
Team:
  • Richard Sproat (Google)
  • Rob Clark (Google)
  • Alexander Gutkin (Google)
  • Martin Jansche (Google)

Overview:
We propose a tutorial on approaches to speech synthesis for low-resourced languages. We will focus on techniques that do not depend on having access to large databases of professionally recorded speech, but rather will work with small databases of good (but not professional) quality speech from multiple speakers. On the text normalization side, we focus on how to build grammars that can be used across languages with minor adaptation to a new language: for example, adapting a Hindi text normalizer to work on Bangla. The tutorial will consist of two parts. The morning sessions will be an introduction to the theory and tools. The tools and resources will include:

  • The open source speech synthesis framework, Festival.
  • The ChitChat recording environment, which allows users to record speech samples easily.
  • Sparrowhawk, an open source version of the Google text normalization system (Kestrel), which interfaces to Festival.
  • The Thrax open source finite state grammar development toolkit.
  • A pronunciation modeling toolkit based on OpenFst and OpenGrm.

After lunch, there will be a hands-on session designed to give participants some experience with the tools. We will build parts of a working end-to-end synthesis system, starting with text normalization and the lexicon, and ending with showing how to construct a voice. We will use data from a low-resourced language, Afrikaans, for examples and for the lab session. It is not assumed that participants will already know the language in question: indeed, part of the experience to be gained from this tutorial is working with a language with which one is not familiar. This is the typical situation for engineers and linguists who develop speech synthesis systems. More details can be found at https://sites.google.com/site/sltututorial/overview.


Biography:

Richard Sproat received his Ph.D. in Linguistics from the Massachusetts Institute of Technology in 1985. He has worked at AT&T Bell Labs, at Lucent's Bell Labs and at AT&T Labs -- Research, before joining the faculty of the University of Illinois. From there he moved to the Center for Spoken Language Understanding at the Oregon Health & Science University. In the Fall of 2012 he moved to Google, New York as a Research Scientist. Sproat has worked in numerous areas relating to language and computational linguistics, including syntax, morphology, computational morphology, articulatory and acoustic phonetics, text processing, text-to-speech synthesis, and text-to-scene conversion. Some of his recent work includes multilingual named entity transliteration, the effects of script layout on child language. At Google he works on multilingual text normalization and finite-state methods for language processing. He also has a long-standing interest in writing systems and symbols systems more generally.


Schedule:
09:00-10:30

Tutorial Part I

  • Introduction to TTS (30 minutes), with an introduction to Festival.
  • Introduction to lexicons, text normalization and tools (60 minutes), with an introduction to Sparrowhawk and Thrax.
10:30-11:00Coffee Break
11:00-12:00

Tutorial Part II

  • Introduction to parametric speech synthesis and tools (60 minutes).
12:00-13:30 Lunch
13:30-15:30

Tutorial Part III

  • Overview of Lab session (15 minutes)
  • Lab session to build components of voice and text normalization for a surprise language.

Target participants, and prerequisites:

  • The target participants are students and professionals with some background in speech technology, e.g. people who have taken classes in speech recognition, or programmers who have worked on speech technology. Some background in parametric ("MM") synthesis is good, but not essential.

  • For the text normalization portion, participants should understand such concepts as "regular language" and "regular expression", should know what a "finite-state acceptor" and "finite-state transducer" are, and hopefully have had at least some exposure to linguistic analysis components of speech systems.

System requirements:

We assume that people will be working with a Linux or Linux-like system as most of the tools we will assume work in Linux-like environments.

Downloading materials:

The only technical prerequisite for the tutorial is that participants should be able to run Docker and access the image: https://hub.docker.com/r/mjansche/tts-tutorial-sltu2016/.
We will cover the use of Festival, etc, during the tutorial.
Further questions can be directed to Martin Jansche (mjansche@google.com).