Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session WeC.P2: Robust ASR II

Type poster
Date Wednesday, August 29, 2007
Time 13:30 – 15:30
Room Alpaerts
Chair Richard Rose (McGill University, Dept. of ECE)


Irrelevant Variability Normalization Based HMM Training Using VTS Approximation of an Explicit Model of Environmental Distortions
Yu Hu, The University of Hong Kong
Qiang Huo, The University of Hong Kong

In a traditional HMM compensation approach to robust speech recognition that uses Vector Taylor Series (VTS) approximation of an explicit model of environmental distortions, the set of generic HMMs are typically trained from "clean" speech only. In this paper, we present a maximum likelihood approach to training generic HMMs from both "clean" and "corrupted" speech based on the concept of irrelevant variability normalization. Evaluation results on Aurora2 connected digits database demonstrate that the proposed approach achieves significant improvements in recognition accuracy compared to the traditional VTS-based HMM compensation approach.

On the Jointly Unsupervised Feature Vector Normalization and Acoustic Model Compensation for Robust Speech Recognition
Luis Buera, University of Zaragoza, Spain
Antonio Miguel, University of Zaragoza, Spain
Eduardo Lleida, University of Zaragoza, Spain
Oscar Saz, University of Zaragoza, Spain
Alfonso Ortega, University of Zaragoza, Spain

To compensate the mismatch between training and testing conditions, an unsupervised hybrid compensation technique is proposed. It combines Multi-Environment Model based LInear Normalization (MEMLIN) with a novel acoustic model adaptation method based on rotation transformations. A set of rotation transformations is estimated between clean and MEMLIN-normalized data by linear regression in a training process. Thus, each MEMLIN-normalized frame is decoded using the expanded acoustic models, which are obtained from the reference ones and the set of rotation transformations. During the search algorithm, one of the rotation transformations is on-line selected for each frame according to the ML criterion in a modified Viterbi algorithm. Some experiments with Spanish SpeechDat Car database were carried out. MEMLIN over standard ETSI front-end parameters reaches 75.53% of mean improvement in WER, while the introduced hybrid solution goes up to 90.54%.

An Ensemble Modeling Approach to Joint Characterization of Speaker and Speaking Environments
Yu Tsao, School of Electrical and Computer Engineering, Georgia Institute of Technology
Chin-Hui Lee, School of Electrical and Computer Engineering, Georgia Institute of Technology

We propose an ensemble modeling framework to jointly characterize speaker and speaking environments for robust speech recognition. We represent a particular environment by a super-vector formed by concatenating the entire set of mean vectors of the Gaussian mixture components in its corresponding hidden Markov model set. In the training phase we generate an ensemble speaker and speaking environment super-vector by concatenating all the super-vectors trained on data from many real or simulated environments. In the recognition phase the ensemble speaker and speaking environment super-vector is converted to the super-vector for the testing environment with an affine transformation that is estimated online with a maximum likelihood (ML) algorithm. We used a simplified formulation for the proposed approach and evaluated its performance on the Aurora 2 database. In an unsupervised adaptation mode, the proposed approach achieves 7.27% and 13.90% WER reductions, respectively, when tested in clean and averaged noisy conditions (from 0dB to 20dB) over the baseline performance on a gender dependent system. The results suggest that the proposed approach can well characterize environments under the presence of either single or multiple distortion sources.

Cluster-based Polynomial-Fit Histogram Equalization (CPHEQ) for Robust Speech Recognition
Shih-Hsiang Lin, National Taiwan Normal University, Taipei, Taiwan
Yao-Ming Yeh, National Taiwan Normal University, Taipei, Taiwan
Berlin Chen, National Taiwan Normal University, Taipei, Taiwan

In this paper, we consider the use of histogram equalization (HEQ) for robust ASR. In contrast to conventional methods, a novel data fitting method based on polynomial regression was presented to efficiently approximate the inverse of the cumulative density function of training speech for HEQ. Moreover, a more elaborate attempt of using such polynomial regression models to directly characterizing the relationship between the feature space domain and the corresponding distribution probabilities, under various noise conditions, was proposed as well. All experiments were carried out on the Aurora-2 database and task. The performance of the presented methods were extensively tested and verified by comparison with the other conventional methods. Experimental results shown that for clean-condition training, our method achieved a considerable word error rate reduction over the baseline system, and also significantly outperformed the other methods.

Robust Distributed Speech Recognition Using Histogram Equalization and Correlation Information
Pedro Manuel Martínez Jiménez, University of Granada, Spain
Jose Carlos Segura Luna, University of Granada, Spain
Luz García Martínez, University of Granada, Spain

In this paper, we propose a noise compensation method for robust speech recognition in DSR (Distributed Speech Recognition) systems based on histogram equalization and correlation information. The objective of this method is to exploit the correlation between components of the feature vector and the temporal correlation between consecutive frames of each component. The recognition experiments, including results in the Aurora 2, Aurora 3-Spanish and Aurora 3-Italian databases, demonstrate that the use of this correlation information increases the recognition accuracy.

Predictive Minimum Bayes Risk Classification for Robust Speech Recognition
Jen-Tzung Chien, Department of Computer Science and Information Engineering, National Cheng Kung University
Koichi Shinoda, Department of Computer Science, Tokyo Institute of Technology
Sadaoki Furui, Department of Computer Science, Tokyo Institute of Technology

This paper presents a new Bayes classification rule towards minimizing the predictive Bayes risk for robust speech recognition. Conventionally, the maximum a posteriori (MAP) classification is constructed by adopting nonparametric loss function and deterministic model parameters. Recognition performance is limited due to the environmental mismatch and the ill-posed model. In this study, we develop the predictive minimum Bayes risk (PMBR) classification where the predictive distributions are inherent in Bayes risk. More specifically, we exploit the Bayes loss function and the predictive word posterior probability for Bayes classification. Model mismatch and randomness are compensated to improve generalization capability in speech recognition. In the experiments, we estimate the prior densities of HMM parameters from adaptation data. With the prior knowledge of new environment and model uncertainty, PMBR is evaluated to be better than MAP, MBR and Bayesian predictive classification.

Applying Word Duration Constraints By Using Unrolled HMMs
Ning Ma, Speech and Hearing Research Group, Department of Computer Science, University of Sheffield, UK
Jon Barker, Speech and Hearing Research Group, Department of Computer Science, University of Sheffield, UK
Phil Green, Speech and Hearing Research Group, Department of Computer Science, University of Sheffield, UK

Conventional HMMs have weak duration constraints. In noisy conditions, the mismatch between corrupted speech signals and models trained on clean speech may cause the decoder to produce word matches with unrealistic durations. This paper presents a simple way to incorporate word duration constraints by unrolling HMMs to form a lattice where word duration probabilities can be applied directly to state transitions. The expanded HMMs are compatible with conventional Viterbi decoding. Experiments on connected-digit recognition show that when using explicit duration constraints the decoder generates word matches with more reasonable durations, and word error rates are significantly reduced across a broad range of noise conditions.

Evaluating the Temporal Structure Normalisation Technique on the Aurora-4 Task
Xiong Xiao, Nanyang Technological University
Eng Siong Chng, Nanyang Technological University
Haizhou Li, Institute for Infocomm Research

We evaluate the temporal structure normalisation (TSN), a feature normalisation technique for robust speech recognition, on the large vocabulary Aurora-4 task. The TSN technique operates by normalising the trend of the feature's power spectral density (PSD) function to a reference function using finite impulse response (FIR) filters. The features are the cepstral coefficients and the normalisation procedure is performed on every cepstral channel of each utterance. Experimental results show that the TSN reduces the average word error rate (WER) by 7.20% and 8.16% relatively over the mean-variance normalisation (MVN) and the histogram equalisation (HEQ) baselines respectively. We further evaluate two other state-of-the-art temporal filters. Experimental results show that among the three evaluated temporal filters, the TSN filter performs the best. Lastly, our results also demonstrates that fixed smoothing filters are less effective on Aurora-4 task than on Aurora-2 task.

Two-Stage System for Robust Neutral/Lombard Speech Recognition
Hynek Boril, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic
Petr Fousek, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic
Harald Hoege, Siemens Corporate Technology, Munich, Germany

Presented paper focuses on a design of a two-stage recognition system (TSR) comprising talking style classifier (neutral/Lombard speech) followed by two style-dedicated recognizers differing in input features. First, the binary neutral/LE classifier is built, with a particular interest in developing suitable features for the classification. Second, performance of common speech features (MFCC, PLP), LE-robust features (Expolog) and newly proposed features is compared in neutral/LE digit recognition tasks. In addition, robustness to the changes of average speech pitch and various noise backgrounds is evaluated. Third, the TSR is built, employing two recognizers, each using style-specific features. Comparison of the proposed system with either neutral-specific or LE-specific recognizer on a joint neutral/LE speech shows an improvement 6.5 to 4.2 % WER on neutral and 48.1 to 28.4 % WER on LE Czech utterances.

Noise Suppression Using Search Strategy with Multi-Model Compositions
Takatoshi Jitsuhiro, ATR Knowledge Science Laboratories
Tomoji Toriyama, ATR Knowledge Science Laboratories
Kiyoshi Kogure, ATR Knowledge Science Laboratories

We introduce a new noise suppression method by using a search strategy with multi-model compositions that includes the following models: speech, noise, and their composites. Before noise suppression, a beam search is performed to find the best sequences of these models using noise acoustic models, noise-label n-gram models, and a noise-label lexicon. Noise suppression is frame-synchronously performed by the multiple models selected by the search. We evaluated this method using the E-Nightingale task, which contains voice memoranda spoken by nurses during actual work at hospitals. For this difficult task, the proposed method obtained a 21.6% error reduction rate.

Investigations into Early and Late Reflections on Distant-talking Speech Recognition Toward Suitable Reverberation Criteria
Takanobu Nishiura, Ritsumeikan University
Yoshiki Hirano, Ritsumeikan University
Yuki Denda, Ritsumeikan University
Masato Nakayama, Ritsumeikan University

Reverberation-robust speech recognition has become very important in the recognition of distant-talking speech. However, as no common reverberation criteria for the recognition of reverberant-speech have been proposed, it has been difficult to estimate this. We investigated a suitable reverberation criterion with the ISO3382 acoustic parameters for distant-talking speech recognition to overcome this problem. We first calculated distant-talking speech recognition with early and late reflections based on the impulse response between the talker and microphone. As a result, we found that early reflections within about 12.5ms from the duration of direct sound contributed slightly to distant-talking speech recognition in non-noisy environments. We then evaluated it based on ISO3382 acoustic parameters. We consequently confirmed that the ISO3382 acoustic parameters are strong candidates for the new reverberation criteria for distant-talking speech recognition.

An Approach to Iterative Speech Feature Enhancement and Recognition
Stefan Windmann, Department of Communication Engineering, University of Paderborn
Reinhold Haeb-Umbach, Department of Communication Engineering, University of Paderborn

In this paper we propose a novel iterative speech feature enhancement and recognition architecture for noisy speech recognition. It consists of model-based feature enhancement employing Switching Linear Dynamical Models (SLDM), a hidden Markov Model (HMM) decoder and a state mapper, which maps HMM to SLDM states. To consistently adhere to a Bayesian paradigm, posteriors are exchanged between these processing blocks. By introducing the feedback from the recognizer to the enhancement stage, enhancement can exploit both the SLDMs ability to model short-term dependencies and the HMMs ability to model long-term dependencies present in the speech data. Experiments have been conducted on the Aurora II database, which demonstrate that significant word accuracy improvements are obtained at low signal-to-noise ratios.

Optimization of Temporal Filters in the Modulation Frequency Domain for Constructing Robust Features in Speech Recognition
Jeih-weih Hung, Dept. of EE, National Chi Nan University, Taiwan

Data-driven temporal filtering approaches based on a specific optimization technique have been shown to be capable of enhancing the discrimination and robustness of speech features in speech recognition. In this paper, we derive new data-driven temporal filters that employ the statistics of the modulation spectra of the speech features. The new temporal filtering approaches are based on the constrained version of Principal Component Analysis (C-PCA) and Maximum Class Distance (C-MCD), respectively. It is shown that the proposed C-PCA and C-MCD temporal filters can effectively improve the speech recognition accuracy in various noise corrupted environments. In experiments conducted on Test Set A of the Aurora-2 noisy digits database, these new temporal filters, together with cepstral mean and variance normalization (CMVN), provides average relative error reduction rates of over 40% and 27%, when compared with the baseline MFCC processing and CMVN alone, respectively.

The Harming Part of Room Acoustics in Automatic Speech Recognition
Rico Petrick, Dresden University of Technology, Laboratory of Acoustics and Speech Communication
Kevin Lohde, Dresden University of Applied Sciences, Department of Electrical Engineering
Matthias Wolff, Dresden University of Technology, Laboratory of Acoustics and Speech Communication
Ruediger Hoffmann, Dresden University of Technology, Laboratory of Acoustics and Speech Communication

Automatic speech recognition (ASR) systems used in real indoor scenarios suffer from different noise and reverberation conditions compared to the training conditions. This article describes a study which aims to find out what are the most harming parts of reverberation to speech recognition. Noise influences are left out. Therefore different real room impulse responses in different rooms and different speaker to microphone distances are measured and modified. The results of the recognition experiments with the related convoluted impulse responses clearly show the dependency of early and late as well as high and low frequency reflections. Conclusions concerning the design of a dereverberation method are made.

A Reference Model Weighting-based Method for Robust Speech Recognition
Yuan-Fu Liao, National Taipei University of Technology, Taipei,Taiwan
Jyh-Her Yang, National Taipei University of Technology, Taipei,Taiwan
Chi-Hui Hsu, National Taipei University of Technology, Taipei,Taiwan
Cheng-Chang Lee, National Taipei University of Technology, Taipei,Taiwan
Jing-Teng Zeng, National Taipei University of Technology, Taipei,Taiwan

In this paper a reference model weighting (RMW) method is proposed for fast hidden Markov model (HMM) adaptation which aims to use only one input test utterance to online estimate the characteristic of the unknown test noisy environment. The idea of RMW is to first collect a set of reference HMMs in the training phase to represent the space of noisy environments, and then synthesize a suitable HMM for the unknown test noisy environment by interpolating the set of reference HMMs. Noisy environ-ment mismatch can hence be efficiently compensated. The proposed method was evaluated on the multi-condition training task of Aurora2 corpus. Experimental results showed that the proposed RMW approach outperformed both the histogram equalization (HEQ) method and the method proposed in European Telecommunications Standards Institute (ETSI) distributed speech recognition (DSR) standard ES 202 212.

Mel Sub-Band Filtering and Compression for Robust Speech Recognition
Babak Nasersharif, Iran university of science and technology
Ahmad Akbari, Iran university of science and technology
Mohammad Mehdi Homayounpour, Amirkabir University of Technology

The Mel-frequency cepstral coefficients (MFCC) are commonly used in speech recognition systems. But, they are high sensitive to presence of external noise. In this paper, we propose a noise compensation method for Mel filter bank energies and so MFCC features. This compensation method is performed in two stages: Mel sub-band filtering and then compression of Mel-sub-band energies. In the compression step, we propose a sub-band SNR-dependent compression function. We use this function in place of logarithm function in conventional MFCC feature extraction in presence of additive noise. Results show that the proposed method significantly improves MFCC features performance in noisy conditions where it decreases average word error rate up to 30% for isolated word recognition on three test sets of Aurora 2 database.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo