Interspeech 2007 Session TuC.P2a: Accent and language identification I
Type
poster
Date
Tuesday, August 28, 2007
Time
13:30 – 15:30
Room
Alpaerts
Chair
Mari Ostendorf (University of Washington)
TuC.P2a‑1
Discriminative Optimization of language adapted HMMs for a Language Identification System based on Parallel Phoneme Recognizers
Josef G. Bauer, Siemens AG
Bernt Andrassy, Siemens AG
Ekaterina Timoshenko, Siemens AG
Recently an unsupervised learning scheme for Hidden Markov Models (HMMs) used in acoustical Language Identification (LID) based on Parallel Phoneme Recognizers (PPR) was proposed. This avoids the high costs for orthographically transcribed speech data and phonetic lexica but was found to introduce a considerable increase of classification errors. Also very recently discriminative Minimum Language Identification Error (MLIDE) optimization of HMMs for PPR based LID was introduced that again only requires language tagged speech data and an initial HMM. The described work shows how to combine both approaches to an unsupervised and discriminative learning scheme. Experimental results on large telephone speech databases show that using MLIDE the relative increase in error rate introduced by unsupervised learning can be reduced from 61% to 26%. The absolute difference in LID error rate due to the supervised learning step is reduced from 4.1% to 0.8%.
TuC.P2a‑2
Fusion of Contrastive Acoustic Models for Parallel Phonotactic Spoken Language Identification
Khe Chai Sim, Institute for Infocomm Research
Haizhou Li, Institute for Infocomm Research
This paper investigates combining contrastive acoustic models for parallel phonotactic language identification systems. PRLM, a typical phonotactic system, uses a phone recogniser to extract phonotactic information from the speech data. Combining multiple PRLM systems together forms a Parallel PRLM (PPRLM) system. A standard PPRLM system utilises multiple phone recognisers trained on different languages and phone sets to provide diversification. In this paper, a new approach for PPRLM is proposed where phone recognisers with different acoustic models are used for the parallel systems. The STC and SPAM precision matrix modelling schemes as well as the MMI training criterion are used to produce contrastive acoustic models. Preliminary experimental results are reported on the NIST language recognition evaluation sets. With only two training corpora, a 12-way PPRLM system, using different acoustic modelling schemes, outperformed the standard 2-way PPRLM system by 2.0-5.0% absolute EER.
TuC.P2a‑3
Multi-Layer Kohonen Self-Organizing Feature Map for Language Identification
Liang Wang, School of EE&Telecom, the University of New South Wales, Australia
Eliathamby Ambikairajah, School of EE&Telecom, the University of New South Wales, Australia
Eric H.C. Choi, ATP Research Laboratory, National ICT Australia, Australia
In this paper we describe a novel use of a multi-layer Kohonen self-organizing feature map (MLKSFM) for spoken language identification (LID). A normalized, segment-based input feature vector is used in order to maintain the temporal information of speech signal. The LID is performed by using different system configurations of the MLKSFM. Compared with a baseline PPRLM system, our novel system is capable of achieving a similar identification rate, but requires less training time and no phone labeling of training data. The MLKSFM with the sheet-shaped map and the hexagonal-lattice neighborhoods relationship is found to give the best performance for the LID task, and this system is able to achieve a LID rate of 76.4% and 62.4% for the 45-sec and 10-sec OGI speech utterances, respectively.
TuC.P2a‑4
Hierarchical Language Identification based on Automatic Language Clustering
Bo Yin, The University of New South Wales
Eliathamby Ambikairajah, The University of New South Wales
Fang Chen, National ICT Australia (NICTA)
Due to the limitation of single-level classification, existing fusion techniques experience difficulty in improving the performance of language identification when the number of languages and features are further increased. Given that the similarity of feature distribution between different languages may vary, we propose a novel hierarchical language identification framework with multi-level classification. In this approach, target languages are hierarchically clustered into groups according to the distance between them, models are trained both for individual languages and language groups, and classification is hierarchically done in multi-levels. This framework is implemented and evaluated in this paper, the results showing an relative 15.1% error-rate improvement in 30s case on OGI 10-language database compared to modern GMM fusion system.
TuC.P2a‑5
Using Speech Rhythm for Acoustic Language Identification
Ekaterina Timoshenko, Siemens AG
Harald Hoege, Siemens AG
This paper presents results on using rhythm for automatic language identification (LID). The idea is to explore the duration of pseudo-syllables as language discriminative feature. The resulting Rhythm system is based on Bigram duration models of neighbouring pseudo-syllables. The Rhythm system is fused with a Spectral system realized by parallel Phoneme Recognition (PPR) approach using MFCC's. The LID systems were evaluated on a 7 languages identification task using the SpeechDat II databases. Tests were performed with 7 seconds utterances. Whereas the Spectral system acting as a baseline system achieved an error rate of 7.9 % the fused system reduced the error rate by 10 % relatively.
TuC.P2a‑6
A Model-based Estimation of Phonotactic Language Verification Performance
Ka-Keung Wong, Hong Kong University of Science and Technology
Man-hung Siu, Hong Kong University of Science and Technology
Brian Mak, Hong Kong University of Science and Technology
One of the most common approaches in language verification (LV) is the phonotactic language verification. Currently, LV performance for different languages under different environments and durations have to be compared experimentally and this can make it difficult to understand LV performances across corpora or durations. LV can be viewed as a special case of hypothesis testing such that Neyman-Pearson theorem and other information theoretic analysis are applicable. In this paper, we introduce a measure of phonotactic confusablity based on the phonotactic distribution, and make it possible to assess the difficulty of the verification problem analytically. We then propose a method of predicting LV performance. The effectiveness of the proposed approach is demonstrated on the NIST 2003 language recognition evaluation test set.
TuC.P2a‑7
A Tagging Algorithm for Mixed Language Identification in a Noisy Domain
Mike Rosner, University of Malta, Malta
Paulseph-John Farrugia, MobIsle Communications Ltd, Malta
The bilingual nature of the Maltese Islands gives rise to frequent occurrences of code switching, both verbally and in writing. In designing a polyglot TTS system capable of handling SMS messages within the local context, it was necessary to come up with a pre-processing mechanism for identifying the language of origin of individual word tokens. Given that certain common words can be interlingually ambiguous and that the domain under consideration is both open and subject to containing various word contractions and spelling mistakes, the task is not as straightforward as it may seem at first. In this paper we discuss a language neutral language identification approach capable of handling the characteristics of the domain in a robust fashion.
TuC.P2a‑8
Improved Language Recognition using Better Phonetic Decoders and Fusion with MFCC and SDC Features
Doroteo T Toledano, ATVS Biometric Recognition Group
Javier Gonzalez-Dominguez, ATVS Biometric Recognition Group
Danilo Spada, ATVS Biometric Recognition Group
Alejandro Abejon-Gonzalez, ATVS Biometric Recognition Group
Ismael Mateos-Garcia, ATVS Biometric Recognition Group
Joaquin Gonzalez-Rodriguez, ATVS Biometric Recognition Group
One of the most popular and better performing approaches to language recognition (LR) is Parallel Phonetic Recognition followed by Language Modeling (PPRLM). In this paper we report several improvements in our PPRLM system that allowed us to move from an Equal Error Rate (EER) of over 15% to less than 8% on NIST LR Evaluation 2005 data still using a standard PPRLM system. The most successful improvement was the retraining of the phonetic decoders on larger and more appropriate corpora. We have also developed a new system based on Support Vector Machines (SVMs) that uses as features both Mel Frequency Cepstral Coefficients (MFCCs) and Shifted Delta Cepstra (SDC). This new SVM system alone gives an EER of 10.5% on NIST LRE 2005 data. Fusing our PPRLM system and the new SVM system we achieve an EER of 5.43% on NIST LRE 2005 data, a relative reduction of almost 66% from our baseline system.