Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session ThB.P2a: Topics in acoustic modeling


Type poster
Date Thursday, August 30, 2007
Time 10:00 – 12:00
Room Alpaerts
Chair Hynek Hermansky (IDIAP Research Institute, Martigny)

ThB.P2a‑1

Comparison of HMM and DTW methods in automatic recognition of pathological phoneme pronunciation
Robert Wielgat, Department of Technology, Higher State Vocational School in Tarnów, Tarnów, Poland
Tomasz Zieliński, Department of Telecommunications, AGH University of Science and Technology, Krakow, Poland
Paweł Świętojański, Department of Technology, Higher State Vocational School in Tarnów, Tarnów, Poland
Piotr Żołądź, Department of Technology, Higher State Vocational School in Tarnów, Tarnów, Poland
Daniel Król, Department of Technology, Higher State Vocational School in Tarnów, Tarnów, Poland
Tomasz Woźniak, Division of Logopedics and Applied Linguistics, Maria Curie-Skłodowska University, Lublin, Poland
Stanisław Grabias, Division of Logopedics and Applied Linguistics, Maria Curie-Skłodowska University, Lublin, Poland

In the paper recently proposed Human Factor Cepstral Coefficients (HFCC) are used to automatic recognition of pathological phoneme pronunciation in speech of impaired children and efficiency of this approach is compared to application of the standard Mel-Frequency Cepstral Coefficients (MFCC) as a feature vector. Both dynamic time warping (DTW), working on whole words or embedded phoneme patterns, and hidden Markov models (HMM) are used as classifiers in the presented research. Obtained results demonstrate superiority of combining HFCC features and modified phoneme-based DTW classifier.
ThB.P2a‑2

Unsupervised Training with Directed Manual Transcription for Recognising Mandarin Broadcast Audio
Kai Yu, Cambridge University Engineering Department
Mark Gales, Cambridge University Engineering Department
Philip Woodland, Cambridge University Engineering Department

The performance of unsupervised discriminative training has been found to be highly dependent on the accuracy of the initial automatic transcription. This paper examines a strategy where a relatively small amount of poorly recognised data are manually transcribed to supplement the automatically transcribed data. Experiments were carried out on a Mandarin broadcast transcription task using both Broadcast News (BN) or Broadcast Conversation (BC) data. A range of experimental conditions are compared for both maximum likelihood and discriminative training using directed manual transcription. Using a standard unsupervised discriminative training approach, only 17% of the reduction in character error rate (CER) from supervised training is obtained. By automatically selecting 18% of the data for manual transcription yields 50% of the CER gain from supervised training. The directed approach to selecting data outperforms the use of a random set of data for manual transcription.
ThB.P2a‑3

Context Dependent Syllable Acoustic Model for Continuous Chinese Speech Recognition
Hao Wu, Speech and Hearing Research Center, National Laboratory on Machine Perception, Peking University, P.R.China
Xihong Wu, Speech and Hearing Research Center, National Laboratory on Machine Perception, Peking University, P.R.China

The choice of basic modeling unit is a very important issue in building acoustic model for a continuous speech recognition task. In this paper, syllable based approaches are presented for Chinese acoustic modeling. Compared with dominant modeling unit of initial and final, syllable can implicitly model the intra-syllable variations in good accuracy. Also, by carefully choosing context modeling schemes and parameter tying methods, syllable based acoustic model can capture longer temporal variations while keeping the complexity of model well controlled. Meanwhile, considering the data unbalanced problem, multiple sized unit model based approaches are also implemented in this research. The experiment result shows that the acoustic model based on the presented syllable approach is effective in improving the performance of Chinese continuous speech recognition.
ThB.P2a‑4

A Sub-optimal Viterbi-like Search for Linear Dynamic Models Classification
Dimitris Oikonomidis, Technical University of Crete
Vassilis Diakoloukas, Technical University of Crete
Vassilis Digalakis, Technical University of Crete

This paper describes a Viterbi-like decoding algorithm applied on segment-models based on linear dynamic systems (LDMs). LDMs are a promising acoustic modeling scheme which can alleviate several of the limitations of the popular Hidden Markov Models (HMMs). There are several implementations of LDMs that can be found in the literature. For our decoding experiments we consider general identifiable forms of LDMs which allow increased state space dimensionality and relax most of the constraints found in other approaches. Results on the AURORA2 database show that our decoding scheme significantly outperforms standard HMMs, particularly under significant noise levels.
ThB.P2a‑5

On the Equivalence of Gaussian HMM and Gaussian HMM-like Hidden Conditional Random Fields
Georg Heigold, RWTH Aachen University, Lehrstuhl fuer Informatik 6 - Computer Science Department, D-52056 Aachen, Germany
Ralf Schlueter, RWTH Aachen University, Lehrstuhl fuer Informatik 6 - Computer Science Department, D-52056 Aachen, Germany
Hermann Ney, RWTH Aachen University, Lehrstuhl fuer Informatik 6 - Computer Science Department, D-52056 Aachen, Germany

In this work we show that Gaussian HMMs (GHMMs) are equivalent to GHMM-like Hidden Conditional Random Fields (HCRFs). Hence, improvements of HCRFs over GHMMs found in literature are not due to a refined acoustic modeling but rather come from the more robust formulation of the underlying optimization problem. Conventional GHMMs are usually estimated with a criterion on segment level whereas hybrid approaches are based on a formulation of the criterion on frame level. In contrast to CRFs, these approaches do not provide scores or do not support more than two classes in a natural way. In this work we analyze these two classes of criteria and propose a refined frame based criterion, which is shown to be an approximation of the associated criterion on segment level. Experimental results concerning these issues are reported for the German digit string recognition task Sietill and the large vocabulary English European Parliament Plenary Sessions (EPPS) task.
ThB.P2a‑6

Speeding-up Neural Network Training Using Sentence and Frame Selection
Stefano Scanzio, Politecnico di Torino - Italy
Laface Pietro, Politecnico di Torino - Italy
Roberto Gemello, Loquendo - Italy
Franco Mana, Loquendo - Italy

Training Artificial Neural Networks (ANNs) with large amounts of speech data is a time intensive task due to the intrinsically sequential nature of the back-propagation algorithm. This paper presents an approach for training ANNs using sentence and frame selection. The goal is to speed-up the training process, and to balance the phonetic coverage of the selected frames, trying to mitigate the classification problems related to the prior probabilities of the individual phonetic classes. These techniques, together with a three-step training approach and software optimizations, reduced by an order of magnitude the training time of our models.
ThB.P2a‑7

Using a Small Development Set to Build a Robust Dialectal Chinese Speech Recognizer
Linquan Liu, Center for Speech Technology, Tsinghua Unvi., China
Thomas Fang Zheng, Center for Speech Technology, Tsinghua Unvi., China
Makoto Akabane, Sony Computer Entertainment Inc., Japan
Ruxin Chen, Sony Computer Entertainment America, USA
Wenhu Wu, Center for Speech Technology, Tsinghua Unvi., China

To make full use of a small development data set to build a robust dialectal Chinese speech recognizer from a standard Chinese speech recognizer (based on Chinese Initial/Final, IF), a novel, simple but effective acoustic modeling method, named state-dependent phoneme-based model merging (SDPBMM), is proposed and evaluated, where a shared-state of standard tri-IF is merged with a state of dialectal mono-IF in terms of pronunciation variation modeling. Specifically, in order to deal with phonetic-level pronunciation variations in SDPBMM, distance-based pronunciation modeling is proposed based on a small dialectal Chinese data set. With a 40-minute Shanghai-dialectal Chinese data set, SDPBMM can achieve a significant syllable error rate (SER) reduction of 14.3% for dialectal Chinese with almost no performance degradation for standard Chinese.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo