Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session TuD.P3b: Spectral analysis, formants and vocal tract models

Type poster
Date Tuesday, August 28, 2007
Time 16:00 – 18:00
Room Keurvels
Chair Yegnanarayana Bayya (International Institute of Information Tecnology Hyderabad)


Linear prediction of audio signals
Toon van Waterschoot, Dept. of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, Leuven, Belgium
Marc Moonen, Dept. of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, Leuven, Belgium

Linear prediction (LP) is a valuable tool for speech analysis and coding, due to the efficiency of the autoregressive model for speech signals. In audio analysis and coding, the sinusoidal model is much more popular, which is partly due to the poor performance of audio LP. By examining audio LP from a spectral estimation point of view, we observe that the distribution of the audio signal's dominant frequencies in the Nyquist interval is a critical factor determining LP performance. In this framework, we describe five existing alternative LP methods and illustrate how they all attempt to solve the observed frequency distribution problem.

Stabilised Weighted Linear Prediction - A Robust All-Pole Method for Speech Processing
Carlo Magi, Laboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, Finland
Tom Bäckström, Laboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, Finland
Paavo Alku, Laboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, Finland

Weighted linear prediction (WLP) is a method to compute all-pole models of speech by applying temporal weighting of the residual energy. By using short-time energy (STE) as a weighting function, the algorithm over-weight those samples that fit the underlying speech production model well. The current work introduces a modified WLP method, stabilised weighted linear prediction (SWLP) leading always to stable all-pole models whose performance can be adjusted by changing the length (denoted by M) of the STE window. With a large M value, the SWLP spectra become similar to conventional LP spectra. A small value of M results in SWLP filters similar to those computed by the minimum variance distortionless response (MVDR) method. The study compares the performances of SWLP, MVDR, and conventional LP in spectral modelling of speech sounds corrupted by Gaussian additive white noise. Results indicate that SWLP is the most robust method against noise especially with a small M value.

Conditionally Linear Gaussian Models for Estimating Vocal Tract Resonances
Daniel Rudoy, Harvard University
Daniel N. Spendley, Harvard University
Patrick J. Wolfe, Harvard University

Vocal tract resonances play a central role in the perception and analysis of speech. Here we consider the canonical task of estimating such resonances from an observed acoustic waveform, and formulate it as a statistical model-based tracking problem. In this vein, Deng and colleagues recently showed that a robust linearization of the formant-to-cepstrum map enables the effective use of a Kalman filtering framework. We extend this model both to account for the uncertainty of speech presence by way of a censored likelihood formulation, as well as to explicitly model formant cross-correlation via a vector autoregression, and in doing so retain a conditionally linear and Gaussian framework amenable to efficient estimation schemes. We provide evaluations using a recently introduced public database of formant trajectories, for which results indicate improvements from twenty to over 30% per formant in terms of root mean square error, relative to a contemporary benchmark formant analysis tool.

Time-Varying Pre-emphasis and Inverse Filtering of Speech
Karl Schnell, Institute of Applied Physics, Goethe-University Frankfurt, Max-von-Laue-Str. 1, D-60438 Frankfurt am Main, Germany
Arild Lacroix, Institute of Applied Physics, Goethe-University Frankfurt, Max-von-Laue-Str. 1, D-60438 Frankfurt am Main, Germany

In this contribution, a time-varying linear prediction method is applied to speech processing. In contrast to the commonly used linear prediction approach, the proposed time-varying method considers the continuous time evolution of the vocal tract and, additionally, avoids block-wise processing. On the assumption that the linear predictor coefficients evolve linearly in sections and continuously over the whole signal, the optimum time-varying coefficients can be determined quasi-analytically by a least mean square approach. The investigations show that the method fits very well the realization of a time-varying pre-emphasis. Furthermore, the results show that the method is suitable for time-varying inverse filtering.

Reconstructing Audio Signals from Modified Non-Coherent Hilbert Envelopes
Joachim Thiemann, McGill University
Peter Kabal, McGill University

In this paper, we present a speech and audio analysis-synthesis method based on a Basilar Membrane (BM) model. The audio signal is represented in this method by the Hilbert envelopes of the responses to complex gammatone filters uniformally spaced on a critical band scale. We show that for speech and audio signals, a perceptually equivalent signal can be reconstructed from the envelopes alone by an iterative procedure that estimates the associated carrier for the envelopes. The rate requirement of the envelope information is reduced by low-pass filtering and sampling, and it is shown that it is possible to recover a signal without audible distortion from the sampled envelopes. This may lead to improved perceptual coding methods.

A Flexible Spectral Modification Method based on Temporal Decomposition and Gaussian Mixture Model
Binh Phu Nguyen, School of Information Science, Japan Advanced Institute of Science and Technology
Masato Akagi, School of Information Science, Japan Advanced Institute of Science and Technology

This paper presents a new spectral modification method to solve two drawbacks of conventional spectral modification methods, insufficient smoothness of the modified spectra between frames and ineffective spectral modification. To overcome the insufficient smoothness, a speech analysis technique called temporal decomposition (TD) is used to model the spectral evolution. Instead of modifying the speech spectra frame by frame, we only need to modify event targets and event functions, and the smoothness of the modified speech is ensured by the shape of the event functions. To overcome the ineffective spectral modification, we explore Gaussian mixture model (GMM) parameters for an input of TD to model the spectral envelope, and develop a new method of modifying GMM parameters in accordance with formant scaling factors. Experimental results show that the effectiveness of the proposed method is verified in terms of the smoothness of the modified speech and the effective spectral modification.

A Comparison of Estimated and MAP-Predicted Formants and Fundamental Frequencies with a Speech Reconstruction Application
Jonathan Darch, University of East Anglia
Ben Milner, University of East Anglia

This work compares the accuracy of fundamental frequency and formant frequency estimation methods and maximum a posteriori (MAP) prediction from MFCC vectors with hand-corrected references. Five fundamental frequency estimation methods are compared to fundamental frequency prediction from MFCC vectors in both clean and noisy speech. Similarly, three formant frequency estimation and prediction methods are compared. An analysis of estimation and prediction accuracy shows that prediction from MFCCs provides the most accurate voicing classification across clean and noisy speech. On clean speech, fundamental frequency estimation outperforms prediction from MFCCs, but as noise increases the performance of prediction is significantly more robust than estimation. Formant frequency prediction is found to be more accurate than estimation in both clean and noisy speech. A subjective analysis of the estimation and prediction methods is also made by reconstructing speech from the acoustic features.

Effect of Incomplete Glottal Closures on Estimates of Glottal Waves via Inverse Filtering of Vowel Sounds
Huiqun Deng, INRS-EMT, University of Quebec
Douglas O'Shaughnessy, INRS-EMT, University of Quebec, Canada

Glottal waves obtained via inverse filtering vowel sounds may contain residual vocal-tract resonances due to incomplete glottal closures. This paper investigates the effect of incomplete glottal closures on the estimates of glottal waves via inverse filtering. It shows that such a residual resonance appears as stationary ripples superimposed on the derivatives of the original glottal wave over a whole glottal cycle. Knowing this, one can determine if there are significant resonances of vocal tracts in the obtained glottal waves. It also shows that given an incomplete glottal closure, better estimates of glottal waves can be obtained from large lip-opening vowel sounds than from other sounds. The glottal waves obtained from /a/ produced by male and female subjects are presented. The obtained glottal waves during rapid vocal-fold collisions exhibit transient positive derivatives, which are explained by the air squeezed by the colliding vocal folds and the air from the glottal chink.

Vocal Tract and Area Function Estimation with both Lip and Glottal Losses
Kaustubh Kalgaonkar, Center for Signal and Image Processing, School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA 30332-0250
Mark Clements, Center for Signal and Image Processing, School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA 30332-0250

Traditional algorithms simplify the lattice recursion for evaluation of the PARCOR's by localizing the loss in vocal tract at one of its ends, the lips or the glottis. In this paper we present a framework for mapping to pseudo areas the VT transfer function with no rigid constraints on the losses in system, thereby allowing losses to be present at both the lips and glottis. This method allows us to calculate the reflection coefficients at both the glottis (r_G) and the lips (r_{Lip}). The area functions obtained from these new PARCOR's, have better temporal (inter-frame) and spatial (intra-frame) predictability.

Detection of instants of glottal closure using characteristics of excitation source
Guruprasad S, Department of Computer Science and Engineering, Indian Institute of Technology Madras, India
Yegnanarayana B, International Institute of Information Technology Hyderabad, India
Sri Rama Murty K, Department of Computer Science and Engineering, Indian Institute of Technology Madras, India

In this paper, we propose a method for detection of glottal closure instants (GCI) in the voiced regions of speech signals. The method is based on periodicity of significant excitations of the vocal tract system. The key idea is the computation of coherent covariance sequence, which overcomes the effect of dynamic range of the excitation source signal, while preserving the locations of significant excitations. The Hilbert envelope of linear prediction residual is used as an estimate of the source of excitation of the vocal tract system. Performance of the proposed method is evaluated in terms of the deviation between true GCIs and hypothesized GCIs, using clean speech and degraded speech signals. The signal-to-noise ratio (SNR) of speech signals in the vicinity of GCIs has significant bearing on the performance of the proposed method. The proposed method is accurate and robust for detection of GCIs, even in the presence of degradations.

A comparative evaluation of the Zeros of Z Transform representation for voice source estimation
Nicolas Sturmel, LIMSI-CNRS, France
Christophe d'Alessandro, LIMSI-CNRS
Boris Doval, LIMSI-CNRS

A new method for voice source estimation is evaluated and compared to Linear Prediction (LP) inverse filtering methods (autocorrelation LPC, covariance LPC and IAIF). The method is based on a causal/anticausal model of the voice source and the ZZT (Zeros of Z-Transform) representation for causal/anticausal signal separation. A database containing synthetic speech with various voice source settings and natural speech with acoustic and electro-glottographic signals was recorded. Formal evaluation of source estimation methods are based on spectral distances. The results show that the ZZT causal/anticausal decomposition method outperforms LP in voice source estimation both for synthetic and natural signals. However, its computational load is much heavier (despite a very simple principle) and the method seems sensitive to noise and computation precision errors.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo