Interspeech 2007 Session WeC.O1: Robust ASR against noise and reverberation
Type
oral
Date
Wednesday, August 29, 2007
Time
13:30 – 15:30
Room
Elisabeth
Chair
Phil Green (Speech and Hearing Research Group, Sheffield)
WeC.O1‑1
13:30
Vector-Quantization based Mask Estimation for Missing Data Automatic Speech Recognition
Maarten Van Segbroeck, Katholieke Universiteit Leuven - Dept. ESAT
Hugo Van hamme, Katholieke Universiteit Leuven - Dept. ESAT
The application of Missing Data Theory (MDT) has shown to improve the robustness of automatic speech recognition (ASR) systems. A crucial part in a MDT-based recognizer is the computation of the reliability masks from noisy data. To estimate accurate masks in environments with unknown, non-stationary noise statistics only weak assumptions can be made about the noise and we need to rely on a strong model for the speech. In this paper, we present a missing data detector that uses harmonicity in the noisy input signal and a vector quantizer (VQ) to confine speech models to a subspace. The resulting system can deal with additive and convolutional noise and shows promising results on the Aurora4 large vocabulary database.
WeC.O1‑2
13:50
Accurate marginalization range for missing data recognition
Sébastien Demange, LORIA-UMR 7503
Christophe Cerisara, LORIA-UMR 7503
Jean-Paul Haton, LORIA-UMR 7503
Missing data recognition has been proposed to increase noise robustness of automatic speech recognition. This strategy relies on the use of a spectrographic mask that gives information about the true clean speech energy of a corrupted signal. This information is then used to refine the data process during the decoding step. We propose in this work a new mask that provides more information about the clean speech contribution than classical masks based on a Signal to Noise Ratio (SNR) thresholding. The proposed mask is described and compared to another missing data approach based on SNR thresholding. Experimental results show a significant word error rate reduction induced by the proposed approach. Moreover, the proposed mask outperforms the ETSI advanced front-end on the HIWIRE corpus.
WeC.O1‑3
14:10
Smooth soft mel-spectrographic masks based on blind sparse source separation
Marco Kühne, The University of Western Australia
Roberto Togneri, The University of Western Australia
Sven Nordholm, Western Australian Telecommunications Research Institute
This paper investigates the use of DUET, a recently proposed blind source separation method, as front-end for missing data speech recognition. Based on the attenuation and delay estimation in stereo signals soft time-frequency masks are designed to extract a target speaker from a mixture containing multiple speech sources. A postprocessing step is introduced in order to remove isolated mask points that can cause insertion errors in the speech decoder. The results for connected digit experiments in a multi-speaker environment demonstrate that the proposed soft masks closely match the performance of the oracle mask designed with a priori knowledge of the source spectra.
WeC.O1‑4
14:30
Model-driven detection of clean speech patches in noise
Jonathan Laidler, Dept. of Computer Science, University of Sheffield
Martin Cooke, Dept. of Computer Science, University of Sheffield
Neil Lawrence, School of Computer Science, University of Manchester
Listeners may be able to recognise speech in adverse conditions by "glimpsing" time-frequency regions where the target speech is dominant. Previous computational attempts to identify such regions have been source-driven, using primitive cues. This paper describes a model-driven approach in which the likelihood of spectro-temporal patches of a noisy mixture representing speech is given by a generative model. The focus is on patch size and patch modelling. Small patches lead to a lack of discrimination, while large patches are more likely to contain contributions from other sources. A "cleanness'' measure reveals that a good patch size is one which extends over a quarter of the speech frequency range and lasts for 40 ms. Gaussian mixture models are used to represent patches. A compact representation based on a 2D discrete cosine transform leads to reasonable speech/background discrimination.
WeC.O1‑5
14:50
"Polyaural" Array Processing for Automatic Speech Recognition in Degraded Environments
Richard Stern, Carnegie Mellon University
Evandro Gouvea, Carnegie Mellon University
Govindarajan Thattai, Carnegie Mellon University
In this paper we present a new method of signal processing for robust speech recognition using multiple microphones. The method, loosely based on the human binaural hearing system, consists of passing the speech signals detected by multiple microphones through bandpass filtering and nonlinear halfwave rectification operations, and then cross-correlating the outputs from each channel within each frequency band. These operations provide rejection of off-axis interfering signals. These operations are repeated (in a non-physiological fashion) for the negative of the signal, and an estimate of the desired signal is obtained by combining the positive and negative outputs. We demonstrate that the use of this approach provides substantially better recognition accuracy than delay-and-sum beamforming using the same sensors for target signals in the presence of additive broadband and speech maskers. Improvements in reverberant environments are tangible but more modest.
WeC.O1‑6
15:10
Adding Noise to Improve Noise Robustness in Speech Recognition
Nicolas Morales, HCTLab, Universidad Autónoma de Madrid, Spain.
Liang Gu, IBM T. J. Watson Research Center, Yorktown Heights, USA.
Yuqing Gao, IBM T. J. Watson Research Center, Yorktown Heights, USA.
In this work we explore a technique for increasing recognition accuracy on speech affected by corrupting noise of an undetermined nature, by the addition of a known and well-behaved noise (masking noise). The same type of noise used for masking is added to the training data, thus reducing the gap between training and test conditions, independent of the type of corrupting noise, or whether it is stationary or not. While still in an early development stage, the new approach shows consistent improvements in accuracy and robustness for a variety of conditions, where no use is made of a-priori knowledge of the corrupting noise. The approach is shown to be of particular interest to the case of cross-talk corrupting noise, a complicated situation in speech recognition for which the relative gain with the proposed approach is over 24%.