Interspeech 2007 Session TuC.P3b: Adaptation in ASR I
Tuesday, August 28, 2007
13:30 – 15:30
Lin-shan Lee (National Taiwan University)
Clustered Maximum Likelihood Linear Basis for Rapid Speaker Adaptation
Yun Tang, Department of Electrical and Computer Engineering, McGill University
Richard Rose, Department of Electrical and Computer Engineering, McGill University
Speaker space based adaptation methods for automatic speech recognition have been shown to provide significant performance improvements for tasks where only a few seconds of adaptation speech is available. This paper proposes a robust, low complexity technique within this general class that has been shown to reduce word error rate, reduce the large storage requirements associated with speaker space approaches, and eliminate the need for large numbers of utterances per speaker in training. The technique is based on representing speakers as a linear combination of clustered linear basis vectors and a procedure is presented for ML estimation these vectors from training data. Significant word error rate reduction was obtained relative to speaker independent performance for the Resource Management and Wall Street Journal task domains.
Rapid Speaker Adaptation by Reference Model Interpolation
Wenxuan Teng, TELISMA
Guillaume Gravier, IRISA (CNRS & INRIA) / METISS
Frédéric Bimbot, IRISA (CNRS & INRIA) / METISS
Frédéric Soufflet, TELISMA
We present in this work a novel algorithm for fast speaker adaptation using only small amounts of adaptation data. It is motivated by the fact that a set of representative speakers can provide a priori knowledge to guide the estimation of a new speaker in the speaker-space. The proposed algorithm enables an a posteriori selection of reference models in the speaker-space as opposed to the a priori selection of reference speaker-space commonly used in techniques such as Eigenvoices. We compare the proposed algorithm with the common rapid adaptation techniques within the context of phoneme recognition task. Experimental results on the IDIOLOGOS and PAIDIALOGOS corpus show that the proposed algorithm achieves slightly better improvement than classic Eigenvoices in phoneme accuracy rate, especially for atypical speakers such as children.
Rapid Unsupervised Speaker Adaptation Using Single Utterance Based on MLLR and Speaker Selection
Randy Gomez, NAIST
Tomoki Toda, NAIST
Hiroshi Saruwatari, NAIST
Shikano Kiyohiro, NAIST
In this paper, we employ the concept of HMM-Sufficient Statistics (HMM-Suff Stat) and N-best speakers selection to realize a rapid implementation of Baum-Welch and MLLR. Only a single arbitrary utterance is required which is used to select the N-best speakers HMM-Suff Stat from the training database as adaptation data. Since HMM-Suff Stat are pre-computed offline, computation load is minimized. Moreover, adaptation data from the target speaker is not needed. An absolute improvement of 1.8% WA is achieved when using the rapid Baum-Welch as opposed to using SI model and an improvement of 1.1% WA is achieved when the rapid MLLR is used compared to rapid Baum-Welch adaptation using HMM-Suff Stat. Adaptation time is as fast as 6 sec and 7 sec respectively. Evaluation is done in noisy environment conditions where the adaptation algorithm is integrated in a speech dialogue system. Additional experiments with VTLN, MAP, and the conventional MLLR are performed.
Robustness of Several Kernel-based Fast Adaptation Methods on Noisy LVCSR
Brian Mak, The Hong Kong University of Science and Technology
Roger Hsiao, Carnegie Mellon University
We have been investigating the use of kernel methods to improve conventional linear adaptation algorithms for fast adaptation, when there are less than 10s of adaptation speech. On clean speech, we had shown that our new kernel-based adaptation methods, namely, embedded kernel eigenvoice (eKEV) and kernel eigenspace-based MLLR (KEMLLR) outperformed their linear counterparts. In this paper, we study their unsupervised adaptation performance under additive and convoluted noises using the Aurora4 Corpus, with no assumption or prior knowledge of the noise type and its level. It is found that both eKEV and KEMLLR adaptation continue to outperform MAP and MLLR, and the simple reference speaker weighting (RSW) algorithm continues to perform favorably with KEMLLR. Furthermore, KEMLLR adaptation gives the greatest overall improvement over the speaker-independent model by about 19%
Estimating VTLN Warping Factors by Distribution Matching
Janne Pylkkönen, Helsinki University of Technology
Several methods exist for estimating the warping factors for vocal tract length normalization (VTLN), most of which rely on an exhaustive search over the warping factors to maximize the likelihood of the adaptation data. This paper presents a method for warping factor estimation that is based on matching Gaussian distributions by Kullback-Leibler divergence. It is computationally more efficient than most maximum likelihood methods, but above all it can be used to incorporate the speaker normalization very early in the training process. This can greatly simplify and speed up the training. The estimation method is compared to the baseline maximum likelihood method in three large vocabulary continuous speech recognition tasks. The results confirm that the method performs well in a variety of tasks and configurations.
Frequency Domain Correspondence for Speaker Normalization
Ming Liu, IFP, University of Illinois at Urbana-Champaign
Xi Zhou, IFP, University of Illinois at Urbana-Champaign
Mark Hasegawa-Johnson, University of Illinois at Urbana-Champaign
Zhang Zhengyou, Microsoft Research
Thomas S. Huang, IFP, University of Illinois at Urbana-Champaign
Due to physiology and linguistic difference between speakers, the spectrum pattern for the same phoneme of two speakers can be quite dissimilar. Without appropriate alignment on the frequency axis, the inter-speaker variation will reduce the modeling efficiency and result in performance degradation. In this paper, a novel data-driven framework is proposed to build the alignment of the frequency axes of two speakers. This alignment between two frequency axes is essentially a frequency domain correspondence of these two speakers. To establish the frequency domain correspondence, we formulate the task as an optimal matching problem. The local matching is achieved by comparing the local features of the spectrogram along the frequency bins. This local matching is actually capturing the similarity of the local patterns along different frequency bins in the spectrogram. After the local matching, a dynamic programming is then applied to find the global optimal alignment between two frequency axes. Experiments on TIDIGITS and TIMIT clearly show the effectiveness of this method.
Unsupervised Training of Adaptation Rate Using Q-learning in Large Vocabulary Continuous Speech Recognition
Masafumi Nishida, Chiba University
Yasuo Horiuchi, Chiba University
Akira Ichikawa, Chiba University
This paper describes a novel approach based on unsupervised training of the MAP adaptation rate using Q-learning. Q-learning is a reinforcement learning technique and is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. The proposed method defines the likelihood of the adapted model as a reward and learns a weight factor that indicates the relative balance between the initial model and adaptation data without the need for supervised data. We conducted recognition experiments on a lecture using a corpus of spontaneous Japanese. We were able to estimate the optimal weight factor using Q-learning in advance. MAP adaptation using the weight factor estimated with the proposed method acquired recognition accuracy that was equivalent to MAP adaptation using a weight factor determined experimentally.
Application of CMLLR in narrow band wide band adapted systems
Martin Karafiat, FIT VUT Brno
Lukas Burget, FIT VUT Brno
Jan Cernocky, FIt VUT Brno
Thomas Hain, University of Sheffiled
The amount of training data has a crucial effect on the accuracy of HMM based meeting recognition systems. Conversational telephone speech matches speech in meetings well. However it is naturally recorded with low bandwidth. In this paper we present a scheme that allows to transform wide-band meeting data into the same space for improved model training. The transformation into a joint space allows simpler and more efficient implementation of joint speaker adaptive training (SAT) as well as adaptation of statistics for heteroscedastic discriminant analysis (HLDA). Models are tested on the NIST RT'05 meeting evaluation where a relative reduction in word error rate of 4% was achieved. With the use of HLDA and SAT the improvement was retained.
Fast adaptation of GMM-based compact models
Christophe Lévy, LIA - University of Avignon
Georges Linarès, LIA - University of Avignon
Jean-François Bonastre, LIA - University of Avignon
In this paper, a new strategy for a fast adaptation of acoustic models is proposed for embedded speech recognition. It relies on a general GMM, which represents the whole acoustic space, associated with a set of HMM state-dependent probability functions modeled as transformations of this GMM. The work presented here takes advantage of this architecture to propose a fast and efficient way to adapt the acoustic models. The adaptation is performed only on the general GMM model, using techniques gathered from the speaker recognition domain. It does not require state-dependent adaptation data and it is very efficient in terms of computational cost. We evaluated our approach in the voice-command task, using a car-based corpus. This adaptation method achieved a relative error-rate decrease of about 10% even if few adaptation data are available. The complete system allows a total relative gain of more than 20% compared to a basic HMM-based system.