Interspeech 2007 Session ThB.O1: Adaptation in ASR II
Thursday, August 30, 2007
10:00 – 12:00
Mark J. F. Gales (Cambridge University)
Efficient Estimation of Speaker-specific Projecting Feature Transforms
Jonas Lööf, RWTH Aachen University
Ralf Schlüter, RWTH Aachen University
Hermann Ney, RWTH Aachen University
This paper introduces a new, efficient approach for estimating projecting feature transforms for speech recognition. It is based on the MMI' criterion, a likelihood ratio criterion motivated by a simplification of the MMI criterion, and is shown to be closely related to HLDA. In comparison to current methods, the new method is faster, making it more suitable for speaker adaptive training, where the number of speakers, and therefore the number of transforms are substantial. The proposed method was integrated into the RWTH parliamentary speeches transcription system. Experimental results are presented using speaker specific projecting transforms, both when used in recognition only and when used for speaker adaptive training, showing consistent improvements. Furthermore, the observed improvements are shown to be additive to the improvement of MLLR. Comparisons to DLT are presented, and results are presented for a new projecting DLT method.
Regularized Feature-Based Maximum Likelihood Linear Regression for Speech Recognition
Mohamed Omar, IBM T. J. Watson Resarch Center
This paper investigates a possible generalization of feature-based maximum likelihood linear regression (FMLLR) which addresses the degradation in the performance of ASR systems due to small perturbations of the training and the testing data. We formulate the problem as a regularized maximum likelihood linear regression problem. Based on this formulation, we describe a computationally efficient algorithm for estimating the linear regression parameters which maximize the sum of the log likelihood and the negative of a measure of the sensitivity of the estimated likelihood to these perturbations. This approach does not make any assumptions about the noise model during training and testing. We present several large vocabulary speech recognition experiments that show significant recognition accuracy improvement compared to using the speaker-adapted baseline models.
Modelling Confusion Matrices to Improve Speech Recognition Accuracy, with an Application to Dysarthric Speech
Santiago Caballero Morales, University of East Anglia
Stephen Cox, University of East Anglia
Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited vocabulary decrease speech recognition accuracy. In this paper, we introduce a technique that can increase recognition accuracy in speakers with low intelligibility by incorporating information from an estimate of the speaker's phoneme confusion matrix. The technique performs much better than standard speaker adaptation when the number of sentences available from a speaker for confusion matrix estimation or adaptation is low, and has similar performance for larger numbers of sentences.
An Active Approach to Speaker and Task Adaptation based on Automatic Analysis of Vocabulary Confusability
Qiang Huo, The University of Hong Kong
Wei Li, The University of Hong Kong
Speaker and task adaptation can be made more efficient if an automatic speech recognition system can actively elicit particularly useful adaptation data from a new speaker for a given speech recognition task. This paper presents such an active approach based on an automatic analysis of how difficult the given task vocabulary is. Comparative experiments are designed and conducted for a simple application scenario of searching an item from a long list via voice. The experimental results demonstrate that the proposed active adaptation strategy performs much better than traditional passive adaptation strategies.
fMPE-MAP: Improved Discriminative Adaptation for Modeling New Domains
Jing Zheng, SRI International
Andreas Stolcke, SRI International
This paper introduces a new adaptation approach, fMPE-MAP, which is an extension to the original fMPE (feature minimum phone error) algorithm, with the enhanced ability in porting Gaussian models and fMPE transforms to a new domain. We applied this approach to the SRI-ICSI 2007 NIST meeting recognition system, for which we ported our conversational telephone speech (CTS) and broadcast news (BN) models to the meeting domain. Experiments showed that the proposed fMPE-MAP approach has comparable or better performance than simply training the fMPE transform on combined data, in addition to the obvious speed advantage. In combination with MPE-MAP, we obtained about 20% relative word error rate reduction on a lecture meeting evaluation test set, over the models trained with the standard MAP approach.
Discriminative MCE-Based Speaker Adaptation of Acoustic Models for a Spoken Lecture Processing Task
Timothy J. Hazen, Massachusetts Institute of Technology
Erik McDermott, NTT Corporation
This paper investigates the use of minimum classification error (MCE) training in conjunction with speaker adaptation for the large vocabulary speech recognition task of lecture transcription. Emphasis is placed on the case of supervised adaptation, though an examination of the unsupervised case is also conducted. This work builds upon our previous work using MCE training to construct speaker independent acoustic models. In this work we explore strategies for incorporating MCE training into a model interpolation adaptation scheme in the spirit of traditional maximum a posteriori probability (MAP) adaptation. Experiments show relative error rate reductions between 3% and 7% over a baseline system which uses standard ML estimation instead of MCE training during the adaptation phase.