Extended Powered Cepstral Normalization (P-CN) with Range Equalization for Robust Features in Speech Recognition
Chang-wen Hsu, National Taiwan University, Taiwan, Republic of China
Lin-shan Lee, National Taiwan University, Taiwan, Republic of China
Cepstral normalization has been popularly used as a powerful approach to produce robust features for speech recognition. A new approach of Powered Cepstral Normalization (P-CN) was recently proposed to normalize the MFCC parameters in the r1-th order powered domain, where r1 > 1.0, and then transform the features back by an 1/r2 power order to a better recognition domain, and it was shown to produce robust features. Here we further extend P-CN to a more effective and efficient form, in which we can on-line find good values of r2 for each utterance in real time based on the concept of dynamic range equalization. The basic idea is that the difference in dynamic ranges of feature parameters is in fact a good indicator for the mismatch degrading the recognition performance. Extensive experimental results showed that the Extended P-CN with range equalization proposed in this paper significantly outperforms the conventional Cepstral Normalization and P-CN in all noisy conditions.
Selection of optimal dimensionality reduction method using Chernoff bound for segmental unit input HMM
Makoto Sakai, DENSO CORPORATION
Norihide Kitaoka, Nagoya University
Seiichi Nakagawa, Toyohashi University of Technology
To precisely model the time dependency of features, segmental unit input HMM with a dimensionality reduction method has been widely used for speech recognition. Linear discriminant analysis (LDA) and heteroscedastic discriminant analysis (HDA) are popular approaches to reduce the dimensionality. We have proposed another dimensionality reduction method called power linear discriminant analysis (PLDA). One expects to select the best dimensionality reduction method that yields the highest recognition performance. This selection process on the basis of trial and error requires a lot of time to train HMMs and to test the recognition performance for each dimensionality reduction method. In this paper we propose a performance comparison method without training or testing. We show that the proposed method using the Chernoff bound can rapidly and accurately evaluate the relative recognition performance.
Fepstrum: An improved modulation spectrum for ASR
Vivek Tyagi, IBM India Research Laboratory
In our previous work, we have introduced fepstrum; an improved modulation spectrum estimation technique that overcomes certain theoretical as well as practical shortcomings in the previously published modulation spectrum related techniques. In this paper, fepstrum performance is rigorously benchmarked on the TIMIT phoneme recognition task against a competitive MFCC baseline, other best results reported on the same task and a heterogeneous and multiple classifier based technique. In our experiments, a simple concatenation of fepstrum and MFCC composite feature is used to train a conventional hidden Markov model Gaussian mixture model (HMM-GMM) recognizer. This composite feature achieves a phoneme recognition accuracy of 74.6% on the TIMIT core test-set which is 1.8% absolute better than the MFCC HMM-GMM recognizer accuracy of 72.8%
Narrowband to Wideband Feature Expansion for Robust Multilingual ASR
Dušan Macho, Center for Human Interaction Research, Motorola Labs, Schaumburg, USA
To build high quality wideband acoustic models for automatic speech recognition (ASR), a large amount of wideband speech training data is required. However, for a particular language, one may have available a lot of narrowband data, but only a limited amount of wideband data. This paper deals with such situation and proposes a narrowband to wideband expansion algorithm that expands the narrowband signal ASR features to wideband ASR features. The algorithm is tested in two practical situations comprising sufficient amount and insufficient amount of original wideband training data. Tests show that using a combination of wideband features and expanded features does not harm the ASR performance when having a sufficient amount of the original wideband data, and it improves the ASR performance significantly when only a limited amount of wideband data is originally available. In the presented multilingual tests, a unique expansion model is trained for four languages from the Speecon...
Non-linear Spectral Contrast Stretching for In-car Speech Recognition
Weifeng Li, IDIAP Research Institute
Herve Bourlard, IDIAP Research Institute
In this paper, we present a novel feature normalization method in the log-scaled spectral domain for improving the noise robustness of speech recognition front-ends. In the proposed scheme, a non-linear contrast stretching is added to the outputs of log mel-filterbanks (MFB) to imitate the adaptation of the auditory system under adverse conditions. This is followed by a two-dimensional filter to smooth out the processing artifacts. The proposed MFCC front-ends perform remarkably well on CENSREC-2 in-car database with an average relative improvement of 29.3% compared to baseline MFCC system. It is also confirmed that the proposed processing in log MFB domain can be integrated with conventional cepstral post-processing techniques to yield further improvements. The proposed algorithm is simple and requires only a small extra computation load.
Clustering-based Two-Dimensional Linear Discriminant Analysis for Speech Recognition
Xiao-Bing Li, INRS-Energy, Materials and Telecommunications, Montreal, Canada
Douglas O'Shaughnessy, INRS-Energy, Materials and Telecommunications, Montreal, Canada
In this paper, a new, Clustering-based Two-Dimensional Linear Discriminant Analysis (Clustering-based 2DLDA) method is proposed for extracting discriminant features in Automatic Speech Recognition (ASR). Based on Two-Dimensional Linear Discriminant Analysis (2DLDA), which works with data represented in matrix space and is adopted to extract discriminant information in a joint spectral-temporal domain, Clustering-based 2DLDA integrates the cluster information in each class by redefining the between-class scatter matrix to tackle the fact that many clusters exist in each state in Hidden Markov Model (HMM)-based ASR. The method was evaluated in the TiDigits connected-digit string recognition and the TIMIT continuous phoneme recognition. Experimental results show that 2DLDA yields a slight improvement on the recognition performance over classical LDA, and our proposed Clustering-based 2DLDA outperforms 2DLDA.
A Study on Temporal Features Derived by Analytic Signal
Yotaro Kubo, Waseda University
Shigeki Okawa, Chiba Institute of Technology
Akira Kurematsu, Waseda University
Katsuhiko Shirai, Waseda University
Traditional feature extraction methods for ASR, such as MFCC and PLP, are extracted from short-term spectral envelopes and can be used to realize promising ASR systems. On the other hand, features extracted by TRAPs-like classifiers are based on long-term envelopes of narrow-band signals. These two forms of feature extractions use a mutual representation of energy in narrow band signals. We have developed a feature extraction system that depends on not only the energy but also the modulation of carrier signals. Some experiments show that not only the spectral envelope and its modulation but also the zero-crossing points and frequency modulation form a significant portion of human speech perception. In this study we propose a method of carrier analysis, evaluate this method, and discuss the effectiveness of carrier analysis for ASR. Our method can reduce the phoneme error rate from 45.7% to 38.6%.
Dimensionality Reduction of Speech Features using Nonlinear Principal Components Analysis
Stephen Zahorian, Department of Electrical and Computer Engineering, Binghamton University
Tara Singh, Department of Electrical and Computer Engineering, Old Dominion University
Hongbing Hu, Department of Electrical and Computer Engineering, Binghamton University
One of the main practical difficulties for automatic speech recognition is the large dimensionality of acoustic feature spaces and the subsequent training problems collectively referred to as the “curse of dimensionality.” Many linear techniques, most notably principal components analysis (PCA) and linear discriminant analysis (LDA) and several variants have been used to reduce dimensionality while attempting to preserve variability and discriminability of classes in the feature space. However, these orthogonal rotations of the feature space are suboptimal if data are distributed primarily on curved subspaces embedded in the higher dimensional feature spaces. In this paper, two neural network based nonlinear transformations are used to represent speech data in reduced dimensionality subspaces. It is shown that a subspace computed with the explicit intent of maximizing classification accuracy is far superior to a subspace derived as to minimize mean square representation error.
Linear Transformation Approach to VTLN Using Dynamic Frequency Warping
Rama Sanand Doddipatla, Indian Institute of Technology, Kanpur
Dinesh Kumar Devineni, Indian Institute of Technology, Kanpur
Srinivasan Umesh, Indian Institute of Technology, Kanpur
In the paper, we present a novel linear transformation approach for frequency warping during vocal tract length normalisation(VTLN) using the idea of dynamic frequency warping(DFW). Linear transformation among the mel-frequency cepstral features provides computational advantage of not having to recompute features for each warp factor in VTLN. The proposed method uses the idea of separating the smoothing and the frequency warping operations in the feature extraction stage unlike the conventional approach where both operations are integrated into the filter-bank operation. The advantage of the proposed DFW approach is that, we can obtain a transformation matrix for any arbitrary warping even when we do not know the functional form or mapping of the warping function. We compare the performance of the proposed method along with recently proposed approaches on one phone classification and two digit recognition tasks.
Features Interpolation Domain for Distributed Speech Recognition and Performance for ITU-T G.723.1 Codec
Vladimir Fabregas Surigué de Alencar, Center for Telecommunications Studies (CETUC), Pontifical Catholic University of Rio de Janeiro (PUC-RIO), Rio de Janeiro, Brazil
Abraham Alcaim, Center for Telecommunications Studies (CETUC), Pontifical Catholic University of Rio de Janeiro (PUC-RIO), Rio de Janeiro, Brazil
In this paper, we examine the best domain to perform features interpolation in Distributed Speech Recognition (DSR) systems. We show that the only one domain where a performance gain can be achieved from the interpolation procedure is in the Line Spectral Frequencies (LSF) domain. A DSR scenario where the ITU-T G.723.1 codec is employed is also investigated. The recognition feature generated from the reconstructed speech is highly sensitive to the encoding noise. We have also shown that the LSF quantization scheme used by the G.723.1 codec decreases the recognition performance by approximately 2 %.
Dynamic Integration of Multiple Feature Streams for Robust Real-Time LVCSR
Shoei Sato, NHK (Japan Broadcasting Corporation) & Waseda University
Kazuo Onoe, NHK (Japan Broadcasting Corporation)
Akio Kobayashi, NHK (Japan Broadcasting Corporation)
Shinich Homma, NHK (Japan Broadcasting Corporation)
Toru Imai, NHK (Japan Broadcasting Corporation)
Tohru Takagi, NHK (Japan Broadcasting Corporation)
Tetsunori Kobayashi, Waseda University
We present a novel method of integrating the likelihoods of multiple feature streams for robust speech recognition. The integration algorithm dynamically calculates a frame-wise stream weight so that a heavier weight is given to a stream that is robust to a variety of noisy environments or speaking styles. Such a robust stream is expected to bring out discriminative ability. The weight is calculated in real time from mutual information between an input stream and active HMM states in a search space. In this paper, we describe three features that are extracted through auditory filters by taking into account the human auditory system extracting amplitude and frequency modulations. These features are expected to provide complementary clues for speech recognition. Speech recognition experiments using field reports and spontaneous commentary from Japanese broadcast news showed that the proposed method reduced error words by 9% relative to the best result obtained from a single stream.
PCA-Based Feature Extraction for Fluctuation in Speaking Style of Articulation
Hironori Matsumasa, Kobe university
Tetsuya Takiguchi, Kobe university
Yasuo Ariki, Kobe university
Ichao Li, Otemon Gakuin University
Toshitaka Nakabayashi, Kobe university
We investigated the speech recognition of a person with articulation disorders resulting from athetoid cerebral palsy. Recently, the accuracy of speaker-independent speech recognition has been remarkably improved by the use of stochastic modeling of speech. However, the use of those acoustic models causes degradation of speech recognition for a person with different speech styles (e.g., articulation disorders). In this paper, we discuss our efforts to build an acoustic model for a person with articulation disorders. The articulation of the first speech tends to become unstable due to strain on muscles and that causes degradation of speech recognition. Therefore, we propose a robust feature extraction method based on PCA (Principal Component Analysis) instead of MFCC. Its effectiveness is confirmed by word recognition experiments.
Multi-stream Features Combination based on Dempster-Shafer Rule for LVCSR System
Fabio Valente, IDIAP Research Institute
Jithendra Vepa, IDIAP Research Institute
Hynek Hermansky, IDIAP Research Institute
This paper investigates the combination of two streams of acoustic features. Extending our previous work on small vocabulary task, we show that combination based on Dempster-Shafer rule outperforms several classical rules like sum, product and inverse entropy weighting even in LVCSR systems. We analyze results in terms of Frame Error Rate and Cross Entropy measures. Experimental framework uses meeting transcription task and results are provided on RT05 evaluation data. Results are consistent with what has been previously observed on smaller databases.
Dimensionality Reduction for Speech Recognition Using Neighborhood Components Analysis
Natasha Singh-Miller, MIT
Michael Collins, MIT
Timothy J. Hazen, MIT
Previous work has considered methods for learning projections of high-dimensional acoustic representations to lower dimensional spaces. In this paper we apply the neighborhood components analysis (NCA) (Goldberger et al, 2005) method to acoustic modeling in a speech recognizer. NCA learns a projection of acoustic vectors that optimizes a criterion that is closely related to the classification accuracy of a nearest-neighbor classifier. We introduce regularization into this method, giving further improvements in performance. We describe experiments on a lecture transcription task, comparing projections learned using NCA and HLDA (Kumar, 1998). Regularized NCA gives a 0.7% absolute reduction in WER over HLDA, which corresponds to a relative reduction of 1.9%.
Probabilistic Latent Speaker Analysis for Large Vocabulary Speech Recognition
Dan Su, PhD candidate, Speech and Hearing Research Center, State Key Laboratory on Machine Perception, Peking University
Xihong Wu, Speech and Hearing Research Center, State Key Laboratory on Machine Perception, Peking University
Huisheng Chi, Professor, Speech and Hearing Research Center, State Key Laboratory on Machine Perception, Peking University
Trajectory folding problem is intrinsic for HMM-based speech recognition systems in which each state is modeled by a mixture of Gaussian components. In this paper, a probabilistic latent semantic analysis (PLSA)-based approach is proposed for use in speech recognition systems to alleviate this problem. The basic idea is that different speech trajectories are strongly correlated with speaker variation, and different speakers may have high scores on certain Gaussian components consistently. Thus, PLSA is adopted to perform co-occurrence analysis between Gaussian components and speakers and provide additional source of information to constrain searching path during decoding procedure. Experimental results show that 11.2% and 2.7% relative reduction on word error rate can be achieved on a homogeneous test set and the 2004 863 evaluation set, respectively.
MRASTA and PLP in Automatic Speech Recognition
SRM Prasanna, Indian Institute of Technology Guwahati
Hynek Hermansky, EPFL and IDIAP Switzerland
This work explores different methods for combining estimated posterior probabilities from Multi-RASTA (MRASTA) and Perceptual Linear Prediction (PLP) features for Automatic Speech Recognition (ASR). The improved performance by the ASR system indicates the complementary nature of information present in MRASTA and PLP. Among the different combining methods explored, product gives best performance.