Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session FrC.P3: Voice activity detection and sound classification

Type poster
Date Friday, August 31, 2007
Time 13:30 – 15:30
Room Keurvels
Chair Christian Wellekens (Eurecom, Sophia Antipolis)


Speech - Nonspeech Discrimination using the Information Bottleneck Method and Spectro-Temporal Modulation Index
Maria Markaki, Multimedia Informatics Lab, Computer Science Department, University of Crete
Michael Wohlmayr, Multimedia Informatics Lab, Computer Science Department, University of Crete
Yannis Stylianou, Multimedia Informatics Lab, Computer Science Department, University of Crete

In this work, we adopt an information theoretic approach - the Information Bottleneck method - to extract the relevant spectro-temporal modulations for the task of speech / non-speech discrimination - non-speech events include music, noise and animal vocalizations. A compact representation (a "cluster prototype") is built for each class consisting of the maximally informative features with respect to the classification task. We assess the similarity of a sound to each representative cluster using the spectro-temporal modulation index (STMI) adapted to handle the contribution of different frequency bands. A simple threshold check is then used for discriminating speech from non-speech events. Conducted experiments have shown that the proposed method has low complexity and high accuracy of discrimination in low SNR conditions compared to recently proposed methods for the same task.

A Uniformly Most Powerful Test for Statistical Model-Based Voice Activity Detection
Keun Won Jang, Department of Electronics and Computer Engineering, Chonnam National University, Gwangju, Korea
Dong Kook Kim, Department of Electronics and Computer Engineering, Chonnam National University, Gwangju, Korea
Joon-Hyuk Chang, Department of Electronic Engineering, Inha University, Incheon, Korea

This paper presents a new voice activity detection (VAD) method using the Gaussian distribution and a uniformly most powerful (UMP) test. The UMP test is employed to derive the new decision rule based on likelihood ratio test (LRT). This method requires the Rayleigh distribution for the magnitude of the noisy spectral component and the adaptive threshold estimated from the noise statistics. Experimental results show that the proposed VAD algorithm based on UMP test outperforms the conventional scheme.

Direct optimisation of a multilayer perceptron for the estimation of cepstral mean and variance statistics
John Dines, IDIAP Research Institute
Jithendra Vepa, IDIAP Research Institute

We propose an alternative means of training a multilayer perceptron for the task of speech activity detection based on a criterion to minimise the error in the estimation of mean and variance statistics for speech cepstrum based features using the Kullback-Leibler divergence. We present our baseline and proposed speech activity detection approaches for multi-channel meeting room recordings and demonstrate the effectiveness of the new criterion by comparing the two approaches when used to carry out cepstrum mean and variance normalisation of features used in our meeting ASR system.

Filtering the Unknown: Speech Activity Detection in Heterogeneous Video Collections
Marijn Huijbregts, University of Twente, International Computer Science Institute
Chuck Wooters, International Computer Science Institute
Roeland Ordelman, University of Twente

In this paper we discuss the speech activity detection system that we used for detecting speech regions in the Dutch TRECVID video collection. The system is designed to filter non-speech like music or sound effects out of the signal without the use of predefined non-speech models. Because the system trains its models on-line, it is robust for handling out-of-domain data. The speech activity error rate on an out-of-domain test set, recordings of English conference meetings, was 4.4%. The overall error rate on twelve randomly selected five minute TRECVID fragments was 11.5%.

Environmentally Aware Voice Activity Detector
Abhijeet Sangwan, Center for Robust Speech Systems (CRSS)
Nitish Krishnamurthy, Center for Robust Speech Systems (CRSS)
John H.L. Hansen, Center for Robust Speech Systems (CRSS)

Traditional voice activity detectors (VADs) tend to be deaf to the acoustical background noise, as they utilize a single operating point for all SNRs and noise types, and attempt to learn the background noise model online from finite data. In this paper, we address the aforementioned issues by designing an environmentally aware (EA) VAD. The EA VAD scheme builds prior offline knowledge of commonly encountered acoustical backgrounds, and also combines the recently proposed competitive Neyman-Pearson (CNP) VAD with a SVM based noise classifier. In operation, the EA VAD obtains accurate noise models of the acoustical background by employing the noise classifier and its prior knowledge of the noise type, and thereafter uses this information to set the best operating point and initialization parameters for the CNP VAD. We report an absolute 10-15% improvement in detection rates of EA VAD over AMR VADs in low SNR.

Noise Robust Voice Activity Detection Based on Switching Kalman Filter
Masakiyo Fujimoto, NTT Communication Science Laboratories, NTT Corporation
Kentaro Ishizuka, NTT Communication Science Laboratories, NTT Corporation

This paper addresses the problem of voice activity detection (VAD) in noisy environments. The VAD method proposed in this paper is based on a statistical model approach, and estimates statistical models sequentially without a priori knowledge of noise. Namely, the proposed method constructs a clean speech / silence state transition model beforehand, and sequentially adapts the model to the noisy environment by using a switching Kalman filter when a signal is observed. The evaluation is carried out by using a VAD evaluation framework, CENSREC-1-C. The evaluation results revealed that the proposed method significantly outperforms the baseline results of CENSREC-1-C as regards VAD accuracy in real environments.

Voice Activity Detection based on Support Vector Machine using Effective Feature Vectors
Q-Haing Jo, School of Electronic and Electrical Engineering, Inha University, Incheon, Korea
Yun-Sik Park, School of Electronic and Electrical Engineering, Inha University, Incheon, Korea
Kye-Hwan Lee, School of Electronic and Electrical Engineering, Inha University, Incheon, Korea
Ji-Hyun Song, School of Electronic and Electrical Engineering, Inha University, Incheon, Korea
Joon-Hyuk Chang, School of Electronic and Electrical Engineering, Inha University, Incheon, Korea

In this paper, we propose effective feature vectors to improve the performance of voice activity detection (VAD) employing a support vector machine (SVM), which is known to incorporate an optimized nonlinear decision over two different classes. To extract the effective feature vectors, we present a novel scheme combining the a posteriori SNR, a priori SNR, and predicted SNR, widely adopted in conventional statistical model-based VAD. Based on the results of experiments, the performance of the SVM-based VAD using novel feature vectors is found to be better than that of ITU-T G.729B and other recently eported methods.

Voice Activity Detection in Degraded Speech Using Excitation Source Information
Sri Rama Murty K, Department of Computer Science and Engg., Indian Institute of Technology Madras, India
Yegnanarayana B, International Institute of Information Technology Hyderabad, India
Guruprasad S, Department of Computer Science and Engg., Indian Institute of Technology Madras, India

This paper proposes a method for detection of voiced regions from speech signals collected in noisy environment. The proposed method is based on the characteristics of excitation source of speech production. The degraded speech signal is processed by linear prediction analysis for deriving the linear prediction residual. Hilbert envelope of the linear prediction residual is processed using covariance analysis to obtain coherently-added covariance signal. The periodicity property of the coherently added covariance signal is exploited to detect the voiced regions using autocorrelation analysis. The performance of the proposed voice activity detection algorithm is evaluated under different noise environments and at different levels of degradation.

Evaluation of Real-time Voice Activity Detection based on High Order Statistics
David Cournapeau, Graduate School of Informatics, Kyoto University, Japan
Tatsuya Kawahara, Graduate School of Informatics, Kyoto University, Japan

We have proposed a method for real-time, unsupervised voice activity detection (VAD). In this paper, problems of feature selection and classification scheme are addressed. The feature is based on High Order Statistics (HOS) to discriminate close and far-field talk, enhanced by a feature derived from the normalized autocorrelation. Comparative effectiveness on several HOS is shown. The classification is done in real-time with a recursive, online EM algorithm. The algorithm is evaluated on the CENSREC-1-C database, which is used for VAD evaluation for automatic speech recognition (ASR) [1], and the proposed method is confirmed to significantly outperform the baseline energy-based method.

Robust Voice Activity Detection Based on Adaptive Sub-band Energy Sequence Analysis and Harmonic Detection
Yanmeng Guo, ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences
Qiang Fu, ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences
Yonghong Yan, ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences

Voice activity detection (VAD) in real-world noise is a very challenging task. In this paper, a two-step methodology is proposed to solve the problem. First, segments with non-stationary components, including speech and dynamic noise, are located using sub-band energy sequence analysis (SESA). Secondly, voice is detected within the selected segments employing the proposed method concerning its harmonic structure. Therefore, speech segments can be accurately detected by this rule-based framework. This algorithm is evaluated in several databases in terms of speech/non-speech discrimination and in terms of word accuracy rate when it is used as the front-end of automatic speech recognition (ASR) system. It provides a more reliable performance over the commonly used standard methods.

The influence of speech activity detection and overlap on speaker diarization for meeting room recordings
Corinne Fredouille, LIA - University of Avignon
Nicholas Evans, LIA - University of Avignon, UWS - University of Wales Swansea

This paper addresses the problem of speaker diarization in the specific context of meeting room recordings which often involve a high degree of spontaneous speech with large overlapped speech segments, speaker noise (laughs, whispers, coughs, etc.) and very short speaker turns. A large variability in signal quality has brought an additional level of complexity. This paper investigates the effects of speech activity detection and overlapped speech through speaker diarization experiments conducted on the NIST RT'05 and NIST RT'06 data sets. Results indicate that our system is highly sensitive to the shape of the initial segmentation and that, perhaps surprisingly, perfect references can even degrade performance. Finally we propose a direction for future research to incorporate confidence values according to acoustic attributes in order to unify what is currently a somewhat disjointed approach to speaker diarization.

Voice Activity Detection Using the Phase Vector in Microphone Array
Gibak Kim, Seoul National University
Nam Ik Cho, Seoul National University

If desired speech source is located at different position from interference, it is possible to exploit spatial selectivity for reliable speech detection. In this paper, we propose a voice activity detector (VAD) for the microphone array system, using spatial information obtained by the eigendecomposition of multi-channel correlation matrix. We use the phase vector as a measure for VAD, which is derived from the principal eigenvector. Voice activity is detected by the log likelihood ratio test under the assumption that phase vectors of speech absent and present signals have complex Gaussian distributions. The proposed algorithm is tested with the real data recorded by 8 microphones and the result shows that it performs better than GSC-based method.

Adaptive weighting of microphone arrays for distant-talking F0 and voiced/unvoiced estimation
Federico Flego, Radio Trevisan
Christian Zieger, FBK-irst
Maurizio Omologo, FBK-irst

This paper introduces a new technique of multi-microphone processing which aims to provide features for the extraction of fundamental frequency and for the classification of voiced/unvoiced segments in distant-talking speech. A multichannel periodicity function (MPF) is derived from an adaptive weighting of normalized and compressed magnitude spectra. This function highlights periodic clues of the given speech signals, even under noisy and reverberant conditions. The resulting MPF features are then exploited for voiced/unvoiced classification based on Hidden Markov Models. Experiments, conducted both on simulated data and on real seminar recordings based on a network of reversed T-shaped arrays, showed the robustness of the proposed technique.

Robust and High-resolution Voiced/Unvoiced Classification in Noisy Speech Using A Signal Smoothness Criterion
Sreenivasa Murthy Anjanappa, Dept. of Electrical Comm. Engg., Indian Institute of Science, Bangalore
Chandra Sekhar Seelamantula, Ecole Polytechnique Federale de Lausanne, Biomedical Imaging Laboratory
Sreenivas Thippur V, Dept. of Electrical Comm. Engg., Indian Institute of Science, Bangalore

We propose a novel technique for robust voiced/unvoiced segment detection in noisy speech, based on local polynomial regression. The local polynomial model is well-suited for voiced segments in speech. The unvoiced segments are noise-like and do not exhibit any smooth structure like voiced segments. This property of smoothness is used for devising a new metric called the variance ratio metric, which, after thresholding, indicates the voiced/unvoiced boundaries with 75% accuracy for 0dB global signal-to-noise ratio (SNR). Another novelty of our algorithm is that it processes the signal sample by sample rather than frame by frame. Simulation results on TIMIT speech database (sampling frequency: 8kHz) for various SNRs are presented to illustrate the performance of the new algorithm.

Audio Classification using Extended Baum-Welch Transformations
Tara N Sainath, MIT
Victor Zue, MIT
Dimitri Kanevsky, IBM

Audio classification has applications in a variety of contexts, such as automatic sound analysis, supervised audio segmentation and in audio information search and retrieval. Extended Baum-Welch (EBW) transformations are most commonly used as a discriminative technique for estimating parameters of Gaussian mixtures, though recently they have been applied in unsupervised audio segmentation. In this paper, we extend the use of these transformations to derive an audio classification algorithm. We find that our method outperforms both the Support Vector Machine (SVM) and Gaussian Mixture Model (GMM) likelihood classification methods.

Automatic Laughter Detection Using Neural Networks
Mary Knox, International Computer Science Institute, Berkeley, California; Department of Electrical Engineering, University of California, Berkeley
Nikki Mirghafori, International Computer Science Institute, Berkeley, California

Laughter recognition is an underexplored area of research. Our goal in this work was to develop an accurate and efficient method to recognize laughter segments, ultimately for the purpose of speaker recognition. Previous work has classified presegmented data as to the presence of laughter using SVMs, GMMs, and HMMs. In this work, we have extended the state-of-the-art in laughter recognition by eliminating the need to presegment the data, while attaining high precision, as well as yielding higher resolution for labeling start and end times. In our experiments, we found neural networks to be a particularly good fit for this problem and the score level combination of the MFCC, AC PEAK, and F0 features to be optimal. We achieved an equal error rate (EER) of 7.9% for laughter recognition, thereby establishing the first results for non-presegmented frame-by-frame laughter recognition on the ICSI Meetings database.

Automatic Acoustic Segmentation for Speech Recognition on Broadcast Recordings
Gang Peng, Dept. of Electrical Engineering, Univ. of Washington, Seattle, WA 98195, USA
Mei-Yuh Hwang, Dept. of Electrical Engineering, Univ. of Washington, Seattle, WA 98195, USA
Mari Ostendorf, Dept. of Electrical Engineering, Univ. of Washington, Seattle, WA 98195, USA

This paper investigates the issue of automatic segmentation of speech recordings for broadcast news (BN) and broadcast conversation (BC) speech recognition. Our previous segmentation algorithm often exhibited high deletion errors, where some speech segments were misclassified as non-speech and thus were never passed on to the recognizer. In contrast with our previous segmentation models, which only differentiated between speech and non-speech segments, phonetic knowledge is applied to represent speech by using multiple models for different types of speech segments. Moreover, the “pronunciation” of the speech segment has been modified to loosen the minimum duration constraint. This method makes use of language specific knowledge, while keeping the number of models low to achieve fast segmentation. Experimental results show that the new segmenter outperforms our previous segmenter significantly, particularly in reducing deletion errors.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo