Interspeech 2007 Session ThC.P1a: Phonetic segmentation and classification II
Thursday, August 30, 2007
13:30 – 15:30
Frank Soong (Microsoft Research Asia)
Dual-Channel Acoustic Detection of Nasalization States
Xiaochuan Niu, Center for Spoken Language Understanding, OGI School of Science & Engineering at OHSU
Jan P. H. van Santen, Center for Spoken Language Understanding, OGI School of Science & Engineering at OHSU
Automatic detection of different oral-nasal configurations during speech is useful for understanding normal nasalization and assessing certain speech disorders. We propose an algorithm to extract nasalization features from dual-channel acoustic signals that are acquired by a simple two-microphone setup. The feature is based on a dual-channel acoustic model and the associated analysis method. We successfully test this feature in speaker-dependent and speaker-independent tasks by comparing it with the conventional single-channel MFCC feature. The proposed feature uniformly performs better in both tasks.
Acoustic Parameters for the Automatic Detection of Vowel Nasalization
Tarun Pruthi, Institute of Systems Research and Dept. of Electrical and Computer Engg., University of Maryland, College Park, MD 20742, USA
Carol Espy-Wilson, Institute of Systems Research and Dept. of Electrical and Computer Engg., University of Maryland, College Park, MD 20742, USA
The aim of this work was to propose Acoustic Parameters (APs) for the automatic detection of vowel nasalization based on prior knowledge of the acoustics of nasalized vowels. Nine automatically extractable APs were proposed to capture the most important acoustic correlates of vowel nasalization (extra pole-zero pairs, F1 amplitude reduction, F1 bandwidth increase and spectral flattening). The performance of these APs was tested on several databases with different sampling rates and recording conditions. Accuracies of 96.28%, 77.90% and 69.58% were obtained by using these APs on StoryDB, TIMIT and WS96/97 databases, respectively, in a Support Vector Machine classifier framework. To our knowledge these results are the best anyone has achieved on this task.
On the Use of Time-Delay Neural Networks for Highly Accurate Classification of Stop Consonants
Jun Hou, Rutgers, the State University of New Jersey
Lawrence Rabiner, Rutgers, the State University of New Jersey
Sorin Dusan, Rutgers, the State University of New Jersey
Time-Delay Neural Networks (TDNN) have been shown by Waibel et al.  to be a good method for the classification of dynamic speech sounds such as voiced stop consonants. In this paper we discuss key issues in the design and training of a TDNN, based on a Multi-Layer Perceptron (MLP), when used for classification of the sets of voiced stop consonants (/b/, /d/, and /g/) and unvoiced stop consonants (/p/, /t/ and /k/) from the TIMIT database. We show that by transforming each input parameter to the TDNN to be a zero mean, unit variance distribution (separately for each phoneme class) we can greatly improve the overall classification performance. The resulting TDNN classification accuracy for voiced or unvoiced stop consonants is around 91%. This performance is achieved without any specific discriminative spectral measurements and can be applied directly to the classification of any of the dynamic phoneme classes.
A New Approach for Phoneme Segmentation of Speech Signals
Ladan Golipour, INRS-EMT, Quebec University, Montreal, Canada
Douglas O'Shaughnessy, INRS-EMT, Quebec University, Montreal, Canada
In this paper, we present a new method for segmenting speech at the phoneme level. For this purpose, we use the short-time Fourier transform of the speech signal. The goal is to recognize the locations of main energy changes in frequency over time, which can be described as phoneme boundaries. We apply a sub-band analysis and search for energy changes in individual bands as well to obtain further precision. Moreover, we employ the modified group-delay function to achieve a more clear representation of the locations of boundaries, and smooth out the undesired fluctuations of the signal. We also study the use of an auditory spectrogram instead of a regular spectrogram in the segmentation process. The method was tested over the phoneticallydiverse part of the Timit database, and the results show that 87% of the boundaries are successfully recognized.
Automatically Learning the Units of Speech by Non-negative Matrix Factorisation
Veronique Stouten, Katholieke Universiteit Leuven - Dept. ESAT
Kris Demuynck, Katholieke Universiteit Leuven - Dept. ESAT
Hugo Van hamme, Katholieke Universiteit Leuven - Dept. ESAT
We present an unsupervised technique to discover the (word-sized) speech units in which a corpus of utterances can be decomposed. First, a fixed-length high-dimensional vector representation of the utterances is obtained. Then, the resulting matrix is decomposed in terms of additive units by applying the non-negative matrix factorisation algorithm. On a small vocabulary task, the obtained basis vectors each represent one of the uttered words. We also investigate the amount of speech data that is needed to obtain a correct set of basis vectors. By decreasing the number of occurrences of the words in the corpus, an indication of the learning rate of the system is obtained.
A Saliency-Based Auditory Attention Model with Applications to Unsupervised Prominent Syllable Detection in Speech
Ozlem Kalinli, Speech Analysis and Interpretation Laboratory (SAIL), Dept. of Electrical Engineering-Systems, University of Southern California (USC)
Shrikanth Narayanan, Speech Analysis and Interpretation Laboratory (SAIL), Dept. of Electrical Engineering-Systems, University of Southern California (USC)
A bottom-up or saliency driven attention allows the brain to detect nonspecific conspicuous targets in cluttered scenes before fully processing and recognizing the targets. Here, a novel biologically plausible auditory saliency map is presented to model such saliency based auditory attention. Multi-scale auditory features are extracted based on the processing stages in the central auditory system, and they are combined into a single master saliency map. The usefulness of the proposed auditory saliency map in detecting the prominent syllable and word locations in speech is tested in an unsupervised manner. When evaluated with broadcast news-style read speech using the BU Radio News Corpus, the model achieves 75.9% accuracy at the syllable level, and 78.1% accuracy at word level. These results compare well to results reported on human performance.
Zero-Crossing-Based Ratio Masking for Sound Segregation
Sung Jun An, KAIST
Young-Ik Kim, KAIST
Rhee Man Kil, KAIST
This paper presents a new method of zero-crossing based binaural mask estimation for sound segregation under the condition that multiple sound sources are present simultaneously. The masking is determined by the estimated sound source directions using the spatial cues such as inter-aural time differences (ITDs) and inter-aural intensity differences (IIDs). In the suggested method, the estimation of ITDs is utilizing the statistical properties of zero-crossings detected from binaural filter-bank outputs. For the masking method, we consider to use the target-to-total power ratio in each segment of the time-frequency domain. We show that this power ratio is optimal from the view point of reconstructing the target speech signal. As a result, the proposed method is able to provide an accurate estimate of sound source directions and also a good masking scheme for speech segregation while offering significantly less computational complexity compared to cross-correlation-based methods.
Event detection of speech signals based on auditory processing with a dynamic compressive gammachirp filterbank
Satomi Tanaka, Kyoto City University of Arts
Minoru Tsuzaki, Kyoto City University of Arts, NICT/ATR Spoken Language Communication Research Labs.
Hiroaki Kato, ATR Cognitive Information Science Labs./NICT
Yoshinori Sagisaka, GITI, Waseda University
To simulate the perceptual extraction of temporal structures of speech, the authors have been proposing an event-plausibilty model that detects the occurrence of subevents in continuous speech signals based on a auditory processing. One of its core components is the filterbank module that simulates the mechanical frequency analysis of the basilar membrane in the cochlea. In this paper, output by the new model using a dynamic compressive gammachirp (dcGC) auditory filterbank was compared with the previous model using a gammatone auditory filterbank. The most important difference between these filters was the nonlinear dynamic level-dependence of the new filter; the previous filterbank was linear. Simulation results revealed that no significant advantage for the new filter (dcGC) was observed for event detection by the event-plausiblity model, which suggests that the algorithm for the event-plausibility model has robustness against differences in peripheral auditory processing.
Segmentation of speech: Child’s play?
Odette Scharenborg, Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands
Mirjam Ernestus, Department of Linguistics, Radboud University Nijmegen, The Netherlands; Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
Vincent Wan, Speech and Hearing Research Group, Department of Computer Science, University of Sheffield, UK
The difficulty of the task of segmenting a speech signal into its words is immediately clear when listening to a foreign language; it becomes suddenly much harder to segment the signal into its words, since the words of the language are unknown. Infants are faced with the same task when learning their first language. This study provides a better understanding of the task infants face while learning their first language. We employed an automatic algorithm on the task of speech segmentation without prior knowledge of the labels of the phonemes of the language. An analysis of the boundaries erroneously placed inside a phoneme showed that the algorithm consistently placed boundaries in ‘dynamic’ phonemes (in which an acoustic change occurs, e.g., plosives which consist of a closure and release part) dividing the phoneme into two (or more) segments. A question for further research is: do infants also have to overcome this difficulty in learning to group together dynamic phonemes?
Dimensionality Reduction Methods Applied to both Magnitude and Phase Derived Features
Andrew Errity, School of Computing, Dublin City University, Dublin 9, Ireland
John McKenna, School of Computing, Dublin City University, Dublin 9, Ireland
Barry Kirkpatrick, School of Computing, Dublin City University, Dublin 9, Ireland
A number of previous studies have shown that speech sounds may have an intrinsic low dimensional structure. Such studies have focused on magnitude-based features ignoring phase information, as is the convention in many speech processing applications. In this paper dimensionality reduction methods are applied to MFCC and modified group delay function (MODGDF) features derived from the magnitude and phase spectrum, respectively. The low dimensional structure of these representations is examined and a method to combine these features is detailed. Results show that both magnitude and phase derived features have a low dimensional structure. MFCCs are found to offer higher accuracy than MODGDFs in phone classification tasks. Results indicate that combining MFCCs and MODGDFs gives improvements for phone classification. PCA is shown to be capable of efficiently combining MFCCs and MODGDFs for improved classification accuracy without large increases in feature dimensionality.