Interspeech 2007 Session FrB.O2: Articulatory features
Friday, August 31, 2007
10:00 – 12:00
Karen Livescu (MIT)
A multitask learning perspective on acoustic-articulatory inversion
Korin Richmond, CSTR, Edinburgh University
This paper proposes the idea that by viewing an inversion mapping MLP from a Multitask Learning perspective, we may be able to relax two constraints which are inherent in using electromagnetic articulography as a source of articulatory information for speech technology purposes. As a first step to evaluating this idea, we perform an inversion mapping experiment in an attempt to ascertain whether the hidden layer of a "multitask" MLP can act beneficially as a hidden representation that is shared between inversion mapping subtasks for multiple articulatory targets. Our results in the case of the tongue dorsum x-coordinate indicate this is indeed the case and show good promise. Results for the tongue dorsum y-coordinate however are not so clear-cut, and will require further investigation.
A comparison of acoustic features for articulatory inversion
Chao Qin, Dept. CSEE, OGI, OHSU
Miguel Carreira-Perpinan, Dept. CSEE, OGI, OHSU
We study empirically the best acoustic parameterization for the acoustic-to-articulatory mapping, a problem of recovering the sequence of vocal tract shapes that produce a given acoustic speech signal. We compare all combinations of following acoustic parameterizations: 1) most popular acoustic features such as MFCC and PLP with and without dynamic features; 2) different short-time window length; 3) different levels of smoothing on the acoustic temporal trajectories. We show that the long window length and the smoothing help to alliviate the jaggedness of acoustic features. Experimental results on a real speech production database showed that the LSF that has dynamic features, window length of 64 ms and second-level smoothing performs the best among all combinations. We further found a 15 ms time delay between acoustic and articulatory frames.
Can unquantised articulatory feature continuums be modelled?
Odette Scharenborg, Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands
Vincent Wan, Speech and Hearing Research Group, Department of Computer Science, University of Sheffield, UK
Articulatory feature (AF) modelling of speech has received a considerable amount of attention in automatic speech recognition research. Although termed ‘articulatory’, previous definitions make certain assumptions that are invalid, for instance, that articulators ‘hop’ from one fixed position to the next. In this paper, we studied two methods, based on support vector classification (SVC) and regression (SVR), in which the articulation continuum is modelled without being restricted to using discrete AF value classes. A comparison with a baseline system trained on quantised values of the articulation continuum showed that both SVC and SVR outperform the baseline for two of the three investigated AFs, with improvements up to 5.6% absolute.
Estimation of Place of Articulation in Stop Consonants for Visual Feedback
Milind S. Shah, Department of Electrical Engineering. Indian Institute of Technology Bombay, Mumbai, India
Prem C. Pandey, Department of Electrical Engineering. Indian Institute of Technology Bombay, Mumbai, India
Speech-training systems providing visual feedback of vocal tract shape are found to be useful for improving vowel articulation. Estimation of vocal tract shape, based on LPC and other analysis techniques, generally fails during stop closures, due to very low signal energy and unavailability of spectral information. Based on estimated area values and line spectrum pair (LSP) coefficients before and after stop closure in vowel-consonant-vowel (VCV) syllables, least-squares bivariate conic and cubic polynomial surfaces were generated and used for shape estimation during stop closure by performing 2D interpolation. It was concluded that, conic surface based modeling of either area values or LSP coefficients, for VCV syllables of the type /aCa/, and its 2D interpolation during stop closure, can estimate consistently bilabial, alveolar, and velar place of constriction.
Compact representations of the articulatory-to-acoustic mapping
Blaise Potard, LORIA
Yves Laprie, LORIA
Articulatory codebooks are very often used to represent the articulatory-to-acoustic mapping. They thus need to be compact while offering a very good acoustic precision. This paper presents a method of articulatory codebook construction more general than that of Ouni [Ouni2005] in the sense that the articulatory-to-acoustic mapping is approximated by multivariable polynomials. The second major contribution concerns the subdivision process which finds out the most efficient subdivision, i.e. that which minimizes the size of the codebook while guarantying a very good acoustic precision. Experiments carried out show that the size of the codebook can be divided by a factor of 20, and simultaneously, the acoustic precision can improved by a factor of 2 by using second order polynomials together with this new construction strategy.
Articulatory Feature Classifiers Trained on 2000 hours of Telephone Speech
Joe Frankel, University of Edinburgh
Mathew Magimai-Doss, ICSI
Simon King, University of Edinburgh
Karen Livescu, MIT
Ozgur Cetin, ICSI
This paper is intended to advertise the public availability of the articulatory feature (AF) classification multi-layer perceptrons (MLPs) which were used in the Johns Hopkins 2006 summer workshop. We describe the design choices, data preparation, AF label generation, and the training of MLPs for feature classification on close to 2000 hours of telephone speech. In addition, we present some analysis of the MLPs in terms of classification accuracy and confusions along with a brief summary of the results obtained during the workshop using the MLPs. We invite interested parties to make use of these MLPs.