Interspeech 2007 Session ThC.P3: Improved acoustic modeling for ASR
Thursday, August 30, 2007
13:30 – 15:30
James Glass (MIT)
Improved HMM/SVM Methods for Automatic Phoneme Segmentation
Jen-Wei Kuo, Institute of Information Science, Academia Sinica
Hung-Yi Lo, Institute of Information Science, Academia Sinica
Hsin-Min Wang, Institute of Information Science, Academia Sinica
This paper presents improved HMM/SVM methods for a two-stage phoneme segmentation framework, which tries to imitate the human phoneme segmentation process. The first stage performs hidden Markov model (HMM) forced alignment according to the minimum boundary error (MBE) criterion. The objective is to align a phoneme sequence of a speech utterance with its acoustic signal counterpart based on MBE-trained HMMs and explicit phoneme duration models. The second stage uses the support vector machine (SVM) method to refine the hypothesized phoneme boundaries derived by HMM-based forced alignment. The efficacy of the proposed framework has been validated on two speech databases: the TIMIT English database and the MATBN Mandarin Chinese database.
Gaussian Mixture Optimization for HMM based on Efficient Cross-validation
Takahiro Shinozaki, Kyoto University
Tatsuya Kawahara, Kyoto University
A Gaussian mixture optimization method is explored using cross-validation likelihood as an objective function instead of the conventional training set likelihood. The optimization is based on reducing the number of mixture components by selecting and merging a pair of Gaussians step by step base on the objective function so as to remove redundant components and improve the generality of the model. Cross-validation likelihood is more appropriate for avoiding over-fitting than the conventional likelihood and can be efficiently computed using sufficient statistics. It results in a better Gaussian pair selection and provides a termination criterion that does not rely on empirical thresholds. Large-vocabulary speech recognition experiments on oral presentations show that the cross-validation method gives a smaller word error rate with an automatically determined model size than a baseline training procedure that does not perform the optimization.
Model-Space MLLR for Trajectory HMMs
Heiga Zen, Nagoya Institute of Technology
Yoshihiko Nankaku, Nagoya Institute of Technology
Keiichi Tokuda, Nagoya Institute of Technology
This paper proposes model-space Maximum Likelihood Linear Regression (mMLLR) based speaker adaptation technique for trajectory HMMs, which have been derived from HMMs by imposing explicit relationships between static and dynamic features. This model can alleviate two limitations of the HMM: constant statistics within a state and conditional independence assumption of state output probabilities without increasing the number of model parameters. Results in a continuous speech recognition experiments show that the proposed algorithm can adapt trajectory HMMs to a specific speaker and improve the performance of a trajectory HMM-based speech recognition system.
In-Context Phone Posteriors as Complementary Features for Tandem ASR
Hamed Ketabdar, IDIAP Research Institute, Martigny, Switzerland, and Swiss Federal Institute of Technology at Lausanne (EPFL)
Herve Bourlard, IDIAP Research Institute, Martigny, Switzerland, and Swiss Federal Institute of Technology at Lausanne (EPFL)
We present a method for integrating prior knowledge (such as phonetic and lexical knowledge), as well as long acoustic context in the phone posterior estimation, and we propose to use the obtained posteriors as complementary posterior features in Tandem ASR configuration. These posteriors are estimated based on HMM state posterior probability definition (typically used in standard HMMs training). In this way, by integrating the appropriate prior knowledge and context, we enhance the estimation of phone posteriors. These new posteriors are called `in-context' or HMM posteriors. We combine these posteriors as complementary evidences with the posteriors estimated from a Multi Layer Perceptron, and use the combined evidence as features for training and inference in Tandem. This approach has improved the performance, as compared to using only MLP estimated posteriors as features in Tandem, on OGI Numbers, Conv. Telephone speech (CTS), and Wall Street Journal (WSJ) databases.
Phone-Discriminating Minimum Classification Error (P-MCE) Training for Phonetic Recognition
Qiang Fu, Georgia Institute of Technology
Xiaodong He, Microsoft Research
Li Deng, Microsoft Research
In this paper, we report a study on performance comparisons of discriminative training methods for phone recognition using the TIMIT database. We propose a new method of phone-discriminating minimum classification error (P-MCE), which performs MCE training at the sub-string or phone level instead of at the traditional string level. Aiming at minimizing the phone recognition error rate, P-MCE nevertheless takes advantage of the well-known, efficient training routine derived from the conventional string-based MCE, using specially constructed one-best lists selected from phone lattices. Extensive investigations and comparisons are conducted between the P-MCE and other discriminative training methods including maximum mutual information (MMI), minimum phone or word error (MPE/MWE), and the other two MCE methods. The P-MCE outperforms most of experimented approaches on the TIMIT database for continuous phone recognition, while achieves comparable results with the MPE method.
Improved Acoustic Modeling for Transcribing Arabic Broadcast Data
Lori Lamel, CNRS-LIMSI
Abdel. Messaoudi, CNRS-LIMSI
Jean-Luc Gauvain, CNRS-LIMSI
This paper summarizes our recent progress in improving the automatic transcription of Arabic broadcast audio data, and some efforts to address the challenges of the broadcast conversational speech. Our efforts are aimed at improving the acoustic, pronunciation and language models taking into account specificities of the Arabic language. In previous work we demonstrated that explicit modeling of short vowels improved recognition performance, even when producing non-vocalized hypotheses. In addition to modeling short vowels, consonant gemination and nunation are now explicitly modeled, alternative pronunciations have been introduced to better represent dialectical variants, and a duration model has been integrated. In order to facilitate training on Arabic audio data with non-vocalized transcripts a generic vowel model has been introduced. Compared with the previous system (used in the 2006 GALE evaluation) the relative word error rate has been reduced by over 10%.
String-based and Lattice-based Discriminative Training for the Corpus of Spontaneous Japanese Lecture Transcription Task
Erik McDermott, NTT Corporation
Atsushi Nakamura, NTT Corporation
This article aims to provide a comprehensive set of acoustic model discriminative training results for the Corpus of Spontaneous Japanese (CSJ) lecture speech transcription task. Discriminative training was carried out for this task using a 100,000 word trigram for several acoustic model topologies, for both diagonal and full covariance models, using both string-based and lattice-based training paradigms. We describe our implementation of the proposal by Macherey et al. for numerical subtraction of the reference lattice statistics from the competitor lattice during lattice-based Minimum Classification Error (MCE) training. We also present results for lattice-based training that does not use such subtraction, corresponding to the well-known Maximum Mutual Information (MMI) approach. MCE/MMI training yielded relative reductions in Word Error Rate of up to 13%. Issues specific to discriminative training for this task are discussed.
Discriminative Noise Adaptive Training Approach for an Environment Migration
Byung-Ok Kang, ETRI, Korea
Ho-Young Jung, ETRI, Korea
Yun-Keun Lee, ETRI, Korea
A combined strategy of noise-adaptive training (NAT) and discriminative-based adaptation is proposed for effective migration of speech recognition systems to other noisy environments. NAT is an effective approach for real-field applications, but does not satisfy the minimum classification error (MCE) criterion for the recognition process and adapts poorly to new environments. The proposed method makes up for the weak points in discriminative adaptation strategies, and presents a new method for improving the MCE approach. Using this new method, experimental results show that the speech recognition system can successfully be migrated to other environments using specific-condition data of the target environment.
Word Confusability - Measuring Hidden Markov Model Similarity
Jia-Yu Chen, Stanford University
Peder Olsen, IBM
John Hershey, IBM
We address the problem of word confusability in speech recognition by measuring the similarity between Hidden Markov Models (HMMs) using a number of recently developed techniques. The focus is on defining a word confusability that is accurate, in the sense of predicting artificial speech recognition errors, and computationally efficient when applied to speech recognition applications. It is shown by using the edit distance framework for HMMs that we can use statistical information measures of distances between probability distribution functions to define similarity or distance measures between HMMs. We use correlation between errors in a real speech recognizer and the HMM similarities to measure how well each technique works. We demonstrate significant improvements relative to traditional phone confusion weighted edit distance measures by use of a Bhattacharyya divergence based edit distance.
Speech Recognition with State-based Nearest Neighbour Classifiers
Thomas Deselaers, RWTH Aachen University
Georg Heigold, RWTH Aachen University
Hermann Ney, RWTH Aachen University
We present a system that uses nearest neighbour classification on the state level of the hidden Markov model. Common speech recognition systems nowadays use Gaussian mixtures with a very high number of densities. We propose to carry this idea to the extreme, such that each observation is a prototype of its own. This approach is well-known and widely used in other areas of pattern recognition and has some immediate advantages over other classification approaches, but has never been applied to speech recognition. We evaluate the proposed method on the SieTill corpus of continuous digit strings and on the large vocabulary EPPS English task. It is shown that nearest neighbour outperforms conventional systems when training data is sparse.
HMM-based Speech Recognition Using Decision Trees Instead of GMMs
Remco Teunen, Toshiba Corporate Research & Development Center
Masami Akamine, Toshiba Corporate Research & Development Center
In this paper, we experiment with decision trees as replacements for Gaussian mixture models to compute the observation likelihoods for a given HMM state in a speech recognition system. Decision trees have a number of advantageous properties, such as that they do not impose restrictions on the number or types of features, and that they automatically perform feature selection. In fact, due to the conditional nature of the decision tree evaluation process, the subset of features that is actually used during recognition depends on the input signal. Automatic state-tying can be incorporated directly into the acoustic model as well, and it too becomes a function of the input signal. Experimental results for the Aurora 2 speech database show that a system using decision trees offers state-of-the-art performance, even without taking advantage of its full potential.
An Improved Method for Unsupervised Training of LVCSR Systems
Christian Gollan, Lehrstuhl für Informatik 6, Computer Science Dept., RWTH Aachen University, Aachen, Germany
Stefan Hahn, Lehrstuhl für Informatik 6, Computer Science Dept., RWTH Aachen University, Aachen, Germany
Ralf Schlüter, Lehrstuhl für Informatik 6, Computer Science Dept., RWTH Aachen University, Aachen, Germany
Hermann Ney, Lehrstuhl für Informatik 6, Computer Science Dept., RWTH Aachen University, Aachen, Germany
In this paper, we introduce an improved method for unsupervised training where the data selection or filtering process is done on state level. We describe in detail the setup of the experiments and introduce the state confidence scores on word and allophone state level for performing the data selection for mixture training on state level. Although we are using a relatively small amount of 180 hours of untranscribed recordings in addition to the available carefully manually transcribed transcriptions of 100 hours, we are able to significantly improve our final speaker adaptive acoustic model. Furthermore, we present promising results by doing system combination using the acoustic models trained on different confidence thresholds. These methods are evaluated on the EPPS corpus starting from the RWTH European English parliamentary speech transcription system. A significant improvement of 7% relative is achieved using less data for unsupervised training than conventional systems require.
A Variational Approach to Robust Maximum Likelihood Estimation for Speech Recognition
Mohamed Omar, IBM T. J. Watson Reserach Center
In many automatic speech recognition (ASR) applications, the data used to estimate the class-conditional feature probability density function (PDF) is noisy, and the test data is mismatched with the training data. This paper addresses the degradation in the performance of ASR systems due to small perturbations of the training data. To approach this problem, we provide a computationally efficient algorithm for estimating the model parameters which maximize the sum of the log likelihood and the negative of a measure of the sensitivity of the estimated likelihood to these perturbations; this approach does not make any assumptions about the noise model during training and testing. We present several large vocabulary speech recognition experiments that show significant recognition accuracy improvement compared to using the baseline maximum likelihood models.
Generating Small, Accurate Acoustic Models with a Modified Bayesian Information Criterion
Kai Yu, Carnegie Mellon University
Rob Rutenbar, Carnegie Mellon University
Although Gaussian mixture models are commonly used in acoustic models for speech recognition, there is no standard method for determining the number of mixture components. Most models arbitrarily assign the number of mixture components with little justification. While model selection techniques with a mathematical derivation, such as the Bayesian information criterion (BIC), have been applied, these criteria focus on properly modeling the true distribution of individual tied-states (senones) without considering the entire acoustic model; this leads to suboptimal speech recognition performance. In this paper we present a method to generate statistically-justified acoustic models that consider inter-senone effects by modifying the BIC. Experimental results in the CMU Communicator domain show that in contrast to previous strategies, the new method generates not only attractively smaller acoustic models, but also ones with lower word error rate.
Sparse Gaussian Graphical Models for Speech Recognition
Peter John Bell, CSTR, University of Edinburgh
Simon King, CSTR, University of Edinburgh
We address the problem of learning the structure of Gaussian graphical models for use in automatic speech recognition, a means of controlling the form of the inverse covariance matrices of such systems. With particular focus on data sparsity issues, we implement a method for imposing graphical model structure on a Gaussian mixture system, using a convex optimisation technique to maximise a penalised likelihood expression. The results of initial experiments on a phone recognition task show a performance improvement over an equivalent full-covariance system.
An HMM Acoustic Model Incorporating Various Additional Knowledge Sources
Sakriani Sakti, NICT / ATR SLC Labs
Konstantin Markov, NICT / ATR SLC Labs
Satoshi Nakamura, NICT / ATR SLC Labs
We introduce a method of incorporating additional knowledge sources into an HMM-based statistical acoustic model. To achieve an easy integration of any additional knowledge sources that might come from any domain, the probabilistic relationship between information sources is first learned through Bayesian network and then global joint probability density function (PDF) of the model is formulated. In case the model becomes too complex and direct BN inference is intractable, we utilize junction tree algorithm to decompose the global joint PDF into a linked set of local conditional PDFs. This way, a simplified form of the model can be constructed and reliably estimated using limited training data. Here, we apply this framework to incorporate accent, gender, and wide-phonetic knowledge information at HMM phonetic model level. Experimental results show that our method improves the word accuracy with respect to the standard HMM.
Comparison of Subspace Methods for Gaussian Mixture Models in Speech Recognition
Matti Varjokallio, Helsinki University of Technology
Mikko Kurimo, Helsinki University of Technology
Speech recognizers typically use high-dimensional feature vectors to capture the essential cues for speech recognition purposes. The acoustics are then commonly modeled with a Hidden Markov Model with Gaussian Mixture Models as observation probability density functions. Using unrestricted Gaussian parameters might lead to intolerable model costs both evaluation- and storagewise, which limits their practical use only to some high-end systems. The classical approach to tackle with these problems is to assume independent features and constrain the covariance matrices to being diagonal. This can be thought as constraining the second order parameters to lie in a fixed subspace consisting of rank-1 terms. In this paper we discuss the differences between recently proposed subspace methods for GMMs with emphasis placed on the applicability of the models to a practical LVCSR system.