Noise-Robust Hands-free Voice Activity Detection With Adaptive Zero Crossing Detection using Talker Direction Estimation
Yuki Denda, Ritsumeikan University
Takamasa Tanaka, Ritsumeikan University
Masato Nakayama, Ritsumeikan University
Takanobu Nishiura, Ritsumeikan University
Yoichi Yamashita, Ritsumeikan University
This paper proposes a novel hands-free voice activity detection (VAD) method utilizing not only temporal features but also spatial features, called adaptive zero crossing detection (AZCD), that uses talker direction estimation. It firstly estimates talker direction to extract two spatial features: spatial reliability and spatial variance, based on weighted cross-power spectrum phase analysis and maximum likelihood estimation. Then, the AZCD detects voice activity frames by robustly detecting zero crossing information of speech with adaptively controlled thresholds using the extracted spatial features in noisy environments. The experimental results in an actual office room confirmed that the VAD performance of the proposed method that utilizes both temporal and spatial features is superior to that of the conventional method that utilizes only the temporal or spatial features.
A Robust Mel-Scale Subband Voice Activity Detector for a Car Platform
Agustín Álvarez, Universidad Politécnica de Madrid
Rafael Martínez, Universidad Politécnica de Madrid
Pedro Gómez, Universidad Politécnica de Madrid
Víctor Nieto, Universidad Politécnica de Madrid
Victoria Rodellar, Universidad Politécnica de Madrid
Voice-controlled devices provide a smart solution to operate add-on appliances in a car. In most of the cases, noise reduction techniques involving a Voice Activity Detector (VAD) are required. Through this paper, a robust method for speech detection under the influence of noise and reverberation in an automobile environment is proposed. This method determines a consistent speech/non-speech discrimination by means of a set of Order-Statistics Filters (OSFs) applied to the log-energies associated to a mel-scale based subband division. The paper also includes an extensive performance evaluation of the algorithm using AURORA3 database recordings. According to our simulation results, the proposed algorithm shows on average a significantly better performance than standard VADs such as ITU-G.729B, GSM-AMR or ETSI-AFE, and other recently reported methods.
Noise Robust Front-end Processing with Voice Activity Detection based on Periodic to Aperiodic Component Ratio
Kentaro Ishizuka, NTT Communication Science Laboratories
Tomohiro Nakatani, NTT Communication Science Laboratories
Masakiyo Fujimoto, NTT Communication Science Laboratories
Noboru Miyazaki, NTT Cyber Space Laboratories
This paper proposes a front-end processing method for automatic speech recognition (ASR) that employs a voice activity detection (VAD) method based on the periodic to aperiodic component ratio (PAR), called PARADE (PAR based Activity DEtection). By considering the powers of the periodic and aperiodic components of the observed signals simultaneously, PARADE can detect speech segments more precisely than conventional VAD methods. PARADE can be applied to a front-end processing technique that employs a robust feature extraction method called SPADE (Subband based Periodicity and Aperiodicity DEcomposition). The noisy ASR performance was examined with the CENSREC-1-C database, which includes connected continuous digit speech utterances drawn from Japanese version of AURORA-2. The SPADE front-end combined with PARADE achieves average word accuracy of 74.22 % at SNRs of 0 to 20 dB. This accuracy is significantly higher than that achieved by the ETSI ES 202 050 front-end (63.66 %).
Feature and Distribution Normalization Schemes for Statistical Mismatch Reduction in Reverberant Speech Recognition
Aik Ming Toh, University of Western Australia
Roberto Togneri, University of Western Australia
Sven Nordholm, Western Australian Telecommunications Research Institute
Reverberant noise has been a major concern in speech recognition systems. Many speech recognition systems, even with state-of-art features, fail to respond to reverberant effects and the recognition rate deteriorates. This paper explores the significance of normalization strategies in reducing statistical mismatches for robust speech recognition in reverberant environment. Previous normalization works focused only on ambient noise and have yet been experimented on reverberant noise. In addition, we propose a new approach for the odd order cepstral moment normalization which is computationally more efficient and reduces the convergence rate in the algorithm. The proposed method is experimentally justified and corroborated by the performance of other normalization schemes. The results emphasize the significance of reducing statistical mismatches in feature space for reverberant speech recognition.
Temporal Masking for Unsupervised Minimum Bayes Risk Speaker Adaptation
Matthew Gibson, Sheffield University
Thomas Hain, Sheffield University
The minimum Bayes risk (MBR) criterion has previously been applied to the task of speaker adaptation in large vocabulary continuous speech recognition. The success of unsupervised MBR speaker adaptation, however, has been limited by the accuracy of the estimated transcription of the acoustic data. This paper addresses this issue not by improving the accuracy of the estimated transcription but via temporal masking of its erroneous regions.
Speech Feature Compensation Based on Pseudo Stereo Codebooks for Robust Speech Recognition in Additive Noise Environments
Tsung-hsueh Hsieh, Dept. of Electrical Engineering, National Chi Nan University, Taiwan
Jeih-weih Hung, Dept. of Electrical Engineering, National Chi Nan University, Taiwan
In this paper, we propose several compensation approaches to alleviate the effect of additive noise on speech features for speech recognition. These approaches are simple yet efficient noise reduction techniques that use online constructed pseudo stereo codebooks to evaluate the statistics in both clean and noisy environments. The process yields transforms for noise-corrupted speech features to make them closer to their clean counterparts. We apply these compensation approaches on various well- known speech features, including MFCC, AMFCC and PLPCC. Experimental results conducted on the Aurora-2 database show that the proposed approaches provide all types of the features with a significant performance gain when compared to the baseline results and those obtained by using the conventional utterance-based CMVN.
Multiband, Multisensor Robust Features for Noisy Speech Recognition
Dimitrios Dimitriadis, National Technical University of Athens, School of ECE
Petros Maragos, National Technical University of Athens, School of ECE
Stamatios Lefkimmiatis, National Technical University of Athens, School of ECE
This paper presents a novel feature extraction scheme taking advantage of both the nonlinear modulation speech model and the spatial diversity of speech and noise signals in a multisensor environment. Herein, we propose applying robust features to speech signals captured by a multisensor array minimizing a noise energy criterion over multiple frequency bands. We show that we can achieve improved recognition performance by minimizing the Teager-Kaiser energy of the noise-corrupted signals in different frequency bands. These Multiband, Multisensor Cepstral (MBSC) features are inspired by similar ones already been applied to single-microphone noisy Speech Recognition tasks with significantly improved results. The recognition results show that the proposed features can perform better than the widely-used MFCC features.
Noise robust speech recognition for voice driven wheelchair
Akira Sasou, National Institute of Advanced Industrial Science and Technology (AIST)
Hiroaki Kojima, National Institute of Advanced Industrial Science and Technology (AIST)
In this paper, we introduce a noise robust speech recognition system for a voice-driven wheelchair. Our system has adopted a microphone array system in order for the user not to need to wear a microphone. By mounting the microphone array system on the wheelchair, our system can easily distinguish the user’s utterances from other voices without using a speaker identification technique. We have also adopted a feature compensation technique. By combining the microphone array system and the feature compensation technique, our system can be applied to various noise environments. This is because the microphone array system can provide reliable information about voice activity detection to the feature compensation method, and the feature compensation method can compensate for the weak point of the microphone array system, which is that the microphone array system tends to be less effective for omni-directional noises.