Interspeech 2007 Session TuD.O3: Accent and language identification II
Tuesday, August 28, 2007
16:00 – 18:00
Tanja Schultz (Carnegie Mellon University)
An open-set detection evaluation methodology applied to language and emotion recognition
David van Leeuwen, TNO Human Factors
Khiet Truong, TNO Human Factors
This paper introduces a detection methodology for recognition technologies in speech for which it is difficult to obtain an abundance of non-target classes. An example is language recognition, where we would like to be able to measure the detection capability of a single target language without confounding with the modeling capability of non-target languages. The evaluation framework is based on a cross validation scheme leaving the non-target class out of the allowed training material for the detector. The framework allows us to use Detection Error Trade-off curves properly. As another application example we apply the evaluation scheme to emotion recognition in order to obtain single-emotion detection performance assessment.
Boosting with Anti-models for Automatic Language Identification
Xi Yang, Hong Kong University of Science and Technology
Man-hung Siu, Hong Kong University of Science and Technology
Herbert Gish, BBN Technologies
Brian Mak, Hong Kong University of Science and Technology
In this paper, we adopt the boosting framework to improve the performance of acoustic-based Gaussian mixture model (GMM) Language Identification (LID) systems. We introduce a set of low-complexity, boosted target and anti-models that are estimated from training data to improve class separation, and these models are integrated during the LID backend process. This results in a fast estimation process. Experiments were performed on the 12-language, NIST 2003 language recognition evaluation classification task using a GMM-acoustic-score-only LID system, as well as the one that combines GMM acoustic scores with sequence language model scores from GMM tokenization. Classification errors were reduced from 18.8% to 10.5% on the acoustic-score-only system, and from 11.3% to 7.8% on the combined acoustic and tokenization system.
Acoustic Language Identification Using Fast Discriminative Training
Fabio Castaldo, Politecnico di Torino - Italy
Daniele Colibro, Loquendo - Italy
Emanuele Dalmasso, Politecnico di Torino - Italy
Pietro Laface, Politecnico di Torino - Italy
Claudio Vair, Loquendo - Italy
Gaussian Mixture Models (GMMs) in combination with Support Vector Machine (SVM) classifiers have been shown to give excellent classification accuracy in speaker recognition. In this work we use this approach for language identification, and we compare its performance with the standard approach based on GMMs. In the GMM-SVM framework, a GMM is trained for each training or test utterance. Since it is difficult to accurately train a model with short utterances, in these conditions the standard GMMs perform better than the GMM-SVM models. To overcome this limitation, we present an extremely fast GMM discriminative training procedure that exploits the information given by the separation hyperplanes estimated by an SVM classifier. We show that our discriminative GMMs provide considerable improvement compared with the standard GMMs and perform better than the GMM-SVM approach for short utterances, achieving state of the art performance for acoustic only systems
Spoken Language Identification Using Score Vector Modeling and Support Vector Machine
Ming Li, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
Hongbin Suo, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
Xiao Wu, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
Ping Lu, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
Yonghong Yan, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
The support vector machine (SVM) framework based on generalized linear discriminate sequence (GLDS) kernel has been shown effective in language identification tasks. In this paper, in order to compensate the distortions due to inter-speaker variability within the same language and solve the practical limitation of computer memory requested by large database training, multiple speaker group based discriminative classifiers are employed to map the cepstral features of speech utterances into discriminative language characterization score vectors (DLCSV). Furthermore, backend SVM classifiers are used to model the probability distribution of each target language in the DLCSV space and the output scores of backend classifiers are calibrated as the final language recognition scores by a pair-wise posterior probability estimation algorithm. The proposed SVM framework is evaluated on 2003 NIST Language Recognition Evaluation databases, achieving an equal error rate of 4.0% in 30-second tasks.
Language Identification based on n-gram Frequency Ranking
Ricardo Cordoba, Speech Technology Group. Dept. of Electronic Engineering. Universidad Politécnica de Madrid
Luis F. D'Haro, Speech Technology Group. Dept. of Electronic Engineering. Universidad Politécnica de Madrid
Fernando Fernandez-Martinez, Speech Technology Group. Dept. of Electronic Engineering. Universidad Politécnica de Madrid
Javier Macias-Guarasa, Speech Technology Group. Dept. of Electronic Engineering. Universidad Politécnica de Madrid
Javier Ferreiros, Speech Technology Group. Dept. of Electronic Engineering. Universidad Politécnica de Madrid
We present a novel approach for language identification based on a text categorization technique, namely an n-gram frequency ranking. We use a Parallel phone recognizer, the same as in PPRLM, but instead of the language model, we create a ranking with the most frequent n-grams, keeping only a fraction of them. Then we compute the distance between the input sentence ranking and each language ranking, based on the difference in relative positions for each n-gram. The objective of this ranking is to be able to model reliably a longer span than PPRLM, namely 5-gram instead of trigram, because this ranking will need less training data for a reliable estimation. We demonstrate that this approach overcomes PPRLM (6% relative improvement) thanks to the inclusion of 4-gram and 5-gram in the classifier. We present two alternatives: ranking with absolute values for the number of occurrences and ranking with discriminative values (11% relative improvement).
Improving Phonotactic Language Recognition with Acoustic Adaptation
Wade Shen, MIT/Lincoln Laboratory
Douglas Reynolds, MIT/Lincoln Laboratory
In recent evaluations of automatic language recognition systems, phonotactic approaches have proven highly effective [Cernocky05] [Navratil06]. However, as most of these systems rely on underlying ASR techniques to derive a phonetic tokenization, these techniques are potentially susceptible to acoustic variability from non-language sources (i.e. gender, speaker, channel, etc.). In this paper we apply techniques from ASR research to normalize and adapt HMM-based phonetic models to improve phonotactic language recognition performance. Experiments we conducted with these techniques show an EER reduction of 29% over traditional PRLM-based approaches.