Speaker Recognition Systems: Paradigms and Challenges

Keynote Speaker

Dr. Thomas Fang Zheng, Research Institute of Information Technology (RIIT), Tsinghua University(THU), China

APSIPA Distinguished Lecturer 2012-2013


Speaker recognition applications are becoming more and more popular. However, in practical applications many factors may affect the performance of systems.

In this talk, a general introduction to speaker recognition will be presented, including definition, applications, category, and key issues in terms of research and application. Robust speaker recognition technologies that are useful to speaker recognition applications will be briefed, including cross channel, multiple speaker, background noise, emotions, short utterance, and time-varying (or aging). Recent research on time-varying robust speaker recognition will be detailed.

Performance degradation with time varying is a generally acknowledged phenomenon in speaker recognition and it is widely assumed that speaker models should be updated from time to time to maintain representativeness. However, it is costly, user-unfriendly, and sometimes, perhaps unrealistic, which hinders the technology from practical applications. From a pattern recognition point of view, the time-varying issue in speaker recognition requires such features that are speaker-specific, and as stable as possible across time-varying sessions. Therefore, after searching and analyzing the most stable parts of feature space, a Discrimination-emphasized Mel-frequency-warping method is proposed. In implementation, each frequency band is assigned with a discrimination score, which takes into account both speaker and session information, and Mel- frequency-warping is done in feature extraction to emphasize bands with higher scores. Experimental results show that in the time-varying voiceprint database, this method can not only improve speaker recognition performance with an EER reduction of 19.1%, but also alleviate performance degradation brought by time varying with a reduction of 8.9%.


Dr. Thomas Fang Zheng is a full research professor and Vice Dean of the Research Institute of Information Technology (RIIT), Tsinghua University (THU), and Director of the Center for Speech and Language Technologies (CSLT), RIIT, THU.

Since 1988, he has been working on speech and language processing. He has been in charge of, or undertaking as a key participant, the R&D of more than 30 national key projects and international cooperation projects, and received awards for more than 10 times from the State Ministry (Commission) of Education, the State Ministry (Commission) of Science and Technology, the Beijing City, and others. So far, he has published over 200 journal and conference papers, 11 (3 for first author) of which were titled the Excellent Papers, and 11 books (refer to for details). He has been serving in many conferences, journals, and organizations.

He is an IEEE Senior member, a CCF (China Computer Federation) Senior Member, an Oriental COCOSDA (Committee for the international Coordination and Standardization of speech Databases and input/output Assessment methods) key member, an ISCA member, an APSIPA (Asia-Pacific Signal and Information Processing Association) member, a council member of Chinese Information Processing Society of China, a council member of the Acoustical Society of China, a member of the Phonetic Association of China, and so on.

He serves as Council Chair of Chinese Corpus Consortium (CCC), a Steering Committee member and a BoG (Board of Governors) member of APSIPA, Chair of the Steering Committee of the National Conference on Man-Machine Speech Communication (NCMMSC) of China, head of the Voiceprint Recognition (VPR) special topic group of the Chinese Speech Interactive Technology Standard Group, Vice Director of Subcommittee 2 on Human Biometrics Application of Technical Committee 100 on Security Protection Alarm Systems of Standardization Administration of China (SAC/TC100/SC2), a member of the Artificial Intelligence and Pattern Recognition Committee of CCF.

He is an associate editor of IEEE Transactions on Audio, Speech, and Language Processing, a member of editorial board of Speech Communication, a member of editorial board of APSIPA Transactions on Signal and Information Processing, an associate editor of International Journal of Asian Language Processing, and a member of editorial committee of the Journal of Chinese Information Processing.

He ever served as co-chair of Program Committee of International Symposium on Chinese Spoken Language Processing (ISCSLP) 2000, member of Technical Committee of ISCSLP 2000, member of Organization Committee of Oriental COCOSDA 2000, member of Program Committee of NCMMSC 2001, member of Scientific Committee of ISCA Tutorial and Research Workshop on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology 2002, member of Organization Committee and international advisor of Joint International Conference of SNLP-O-COCOSDA 2002, General Chair of Oriental COCOSDA 2003, member of Scientific Committee of International Symposium on Tonal Aspects of Languages (TAL) 2004, member of Scientific Committee and Session Chair of ISCSLP 2004, chair of Special Session on Speaker Recognition in ISCSLP 2006, Program Committee Chair of NCMMSC 2007, Program Committee Chair of NCMMSC 2009, Tutorial Co-Chair of APSIPA ASC 2009, Program Committee Chair of NCMMSC 2011, general co-chair of APSIPA ASC 2011, and APSIPA Distinguished Lecturer (2012-2013).

He has been also working on the construction of "Study-Research-Product" channel, devoted himself in transferring speech and language technologies into industries, including language learning, embedded speech recognition, speaker recognition for public security and telephone banking, location-centered intelligent information retrieval service, and so on. Now he holds over 10 patents in various aspects of speech and language technologies.

He has been supervising tens of doctoral and master students, several of who were awarded, and therefore he was entitled Excellent Graduate Supervisor. Recently, he received 1997 Beijing City Patriotic and Contributing Model Certificate, 1999 National College Young Teacher (Teaching) Award issued by the Fok Ying Tung Education Foundation of the Ministry of Education (MOE), 2000 1st Prize of Beijing City College Teaching Achievement Award, 2001 2nd Prize Beijing City Scientific and Technical Progress Award, 2007 3rd Prize of Science and Technology Award of the Ministry of Public Security, and 2009 China "Industry-University-Research Institute" Collaboration Innovation Award.


A multi-disciplinary approach for processing under-resourced languages

Keynote Speaker

Professor Laurent Besacier, LIG - University Joseph Fourier, FRANCE


The term "under-resourced languages" refers to a language with some of (if not all) the following aspects: lack of a unique writing system or stable orthography, limited presence on the web, lack of linguistic expertise, lack of electronic resources for NLP (natural language processing) such as monolingual corpora, bilingual electronic dictionaries, transcribed speech data, etc. Porting a NLP system (eg a speech recognition system) to such a language requires techniques that go far beyond the basic re-training of the models. Indeed, processing a new language often leads to new challenges (special phonological systems, word segmentation problems, unwritten language, etc.). The lack of resources requires, on its side, innovative data collection methodologies (via crowdsourcing for instance) or models for which information is shared between languages (eg multilingual acoustic models). In addition, some social and cultural aspects related to the context of the targeted language, bring additional problems: languages with many dialects in different regions, code-switching or code-mixing phenomena, massive presence of non-native speakers (in vehicular languages such as Swahili). Finally, contributing to the development of technologies for these under-resourced languages contributes to their revitalization or (at least) documentation which can be considered as extremely important ("We should treat language diversity as we treat bio-diversity" David Crystal, Language death - Cambridge : CUP, 2000).

Thus, automatic processing of under-resourced languages is a way to study language diversity with a multi-disciplinary view. When addressing utomatic processing of under-resourced languages, an important problem is the gap between language experts (the speakers themselves) and technology experts (system developers). Indeed, it is almost impossible to find native speakers with the necessary technical skills to develop their own systems. Moreover, under-resourced languages are often poorly addressed in the linguistics literature and very few studies describe them. To bootstrap systems for such languages, one has to borrow resources and knowledge from similar languages, which requires the help of dialectologists (find proximity indices between languages), phoneticians (map the phonetic inventories between the targeted under-resourced language and some more resourced ones, etc.). Moreover, for some languages, it is sometime interesting to challenge the paradigms and common practices: is the word the best unit for language modelling ? Is the phoneme the best unit for acoustic modelling? Moreover, for some (rare, endangered) languages, it is often necessary to work with ethno linguists in order to access to native speakers and in order to collect data in accordance with the basic technical and ethical rules. Finally, in the case of endangered languages, a natural application of natural language processing technologies is the development of computer assisted language learning systems, participating to the revitalization of the targeted language. This often requires working with language teachers, in order to collect realistic data and to assess the systems developed.

In this talk, I will present some of my contributions on this topic for languages from four continents (Paes in Colombia, Vietnamese and Khmer in Asia, Amharic and Swahili in Africa and under-resourced languages from Eastern Europe).


Laurent Besacier obtained an engineering degree in electronics and information processing in 1995 and a Ph.D. in computer science in 1998. His PhD thesis was dedicated to speaker recognition (voice biometrics). During his post-doc in a signal processing laboratory, he contributed, via a European project, to multimodal biometrics and he also started working on automatic speech recognition, which became his main topic of research after 1999 (after he obtained an associate professor position in Computer Science at the University of Grenoble-1). He strongly contributed to the development of speech technologies for under-resourced languages, a topic which is now highly visible in the speech and language community. After a sabbatical year spent at IBM Watson research center (2005-06 - speech translation project from dialectal arabic spoken in Iraq), Laurent Besacier initiated research in machine translation. These topic developments (machine translation, under-resourced languages) led him to approach researchers from the humanities (linguists, phoneticians, translators). L. Besacier is a full professor at University Grenoble 1 (Joseph Fourier) since September 2009.