Interspeech 2007 Session TuC.O1: Discriminative and large margin techniques in acoustic modeling
Tuesday, August 28, 2007
13:30 – 15:30
Erik McDermott (NTT Corporation)
Soft Margin Feature Extraction for Automatic Speech Recognition
Jinyu Li, Georgia Institute of Technology
Chin-Hui Lee, Georgia Institute of Technology
We propose a new discriminative learning framework, called soft margin feature extraction (SMFE), for jointly optimizing the parameters of transformation matrix for feature extraction and of hidden Markov models (HMMs) for acoustic modeling. SMFE extends our previous work of soft margin estimation (SME) for feature extraction. Tested on the TIDIGITS connected digit recognition task, the proposed approach achieves a string accuracy of 99.61%, much better than our previously reported SME results. To our knowledge, this is the first study on applying the margin-based method in joint optimization of feature extraction and acoustic modeling. The excellent performance of SMFE demonstrates the success of soft margin based method, which targets to obtain both high accuracy and good model generalization.
A Fast Optimization Method for Large Margin Estimation of HMMs based on Second Order Cone Programming
Yan Yin, Department of Computer Science and Engineering, York University
Hui Jiang, Department of Computer Science and Engineering, York University
In this paper, we present a new fast optimization method to solve large margin estimation (LME) of continuous density hidden Markov models (CDHMMs) for speech recognition based on second order cone programming (SOCP). SOCP is a class of nonlinear convex optimization problems which can be solved quite efficiently. In this work, we have proposed a new convex relaxation condition under which LME of CDHMMs can be formulated as an SOCP problem. The new LME/SOCP method has been evaluated in a connected digit string recognition task using the TIDIGITS database. Experimental results clearly demonstrate that the LME using SOCP outperforms the previous gradient descent method and can achieve comparable performance as our previously proposed semidefinite programming (SDP) approach. But the SOCP yields much better efficiency in terms of optimization time (about 20-200 times faster) and memory usage when compared with the SDP method.
Frame margin probability discriminative training algorithm for noisy speech recognition
Hao-Zheng Li, INRS-EMT University of Quebec
Douglas O'Shaughnessy, INRS-EMT University of Quebec
This paper presents a novel discriminative training technique for noisy speech recognition. First, we define a Frame Margin Probability (FMP) which denotes the difference of score of a frame on its right model and on its competing model. The frames with negative FMP values are regarded as confusable frames and the frames with positive FMP values are regarded as discriminable frames. Second, the confusable frames will be emphasized and the overly discriminable frames will be deweighted by an empirical weighting function. Then the acoustic model parameters are tuned using the weighted frames. By this kind of weighting, the confusable frames, which are often noisy, can contribute more to the acoustic model than those without weighting. We evaluate this technology using the Aurora standard database (TIdigits) and HTK3.3, and obtain a 15.9% WER reduction for noisy speech recognition and a 13.13% WER reduction for clean speech recognition compared with the MLE baseline systems.
Hierarchical Neural Networks Feature Extraction for LVCSR System
Fabio Valente, IDIAP research institute
Jithendra Vepa, IDIAP research institute
Christian Plahl, RWTH Aachen University
Christian Gollan, RWTH Aachen University
Hynek Hermansky, IDIAP research institute
Ralf Schluter, RWTH Aachen University
This paper investigates the use of a hierarchy of Neural Networks for performing data driven feature extraction. Two different hierarchical structures based on long and short temporal context are considered. Features are tested on two different LVCSR systems for Meetings data (RT05 evaluation data) and for Arabic Broadcast News (BNAT05 evaluation data). The hierarchical NNs outperforms the single NN features consistently on different type of data and tasks and provides significant improvements w.r.t. respective baselines systems. Best result is obtained when different time resolutions are used at different level of the hierarchy.
Bhattacharyya Error and Divergence using Variational Importance Sampling
Peder Olsen, IBM
John Hershey, IBM
Many applications require the use of divergence measures between probability distributions. Several of these, such as the Kullback Leibler (KL) divergence and the Bhattacharyya divergence, are tractable for single Gaussians, but intractable for complex distributions such as Gaussian mixture models (GMMs) used in speech recognizers. For tasks related to classification error, the Bhattacharyya divergence is of special importance. Here we derive efficient approximations to the Bhattacharyya divergence for GMMs, using novel variational methods and importance sampling. We introduce a combination of the two, variational importance sampling (VISa), which performs importance sampling using a proposal distribution derived from the variational approximation. VISa achieves the same accuracy as naive importance sampling at a fraction of the computation. Finally we apply the Bhattacharyya divergence to compute word confusability and compare the corresponding estimates using the KL divergence.
Phoneme Dependent Frame Selection Preference
Tingyao Wu, Dept. ESAT, Katholieke Universiteit Leuven, Belgium
Jacques Duchateau, Dept. ESAT, Katholieke Universiteit Leuven, Belgium
Dirk Van Compernolle, Dept. ESAT, Katholieke Universiteit Leuven, Belgium
In previous study we proposed algorithms to select representative frames from a segment for phoneme likelihood evaluation. In this paper we show that this frame selection behavior is phoneme dependent. We observe that some phonemes benefit from frame selection while others do not, and that this separation matches the phonetic categories. For those phonemes sensitive to frame selection, we find that selecting frames at some pre-defined positions in the segment enhances the discrimination between phonemes. These phoneme-dependent positions are explicitly retrieved and used in a phoneme classification task. Experimental results on the TIMIT phonetic database show that the frame selection method significantly outperforms decoding by the classical Viterbi decoder.