Interspeech 2007 Session WeB.SS: Structure-based and template-based automatic speech recognition
Type
special
Date
Wednesday, August 29, 2007
Time
10:00 – 12:00
Room
Astrid Scala 1
Chair
Li Deng (Microsoft research), Helmer Strik (Radboud University Nijmegen)
More detailed information about this session can be found here.
WeB.SS‑1
Temporal Episodic Memory Model: An Evolution of MINERVA2
Viktoria Maier, Department of Speech and Hearing, University of Sheffield, Sheffield, United Kingdom
Roger K. Moore, Department of Speech and Hearing, University of Sheffield, Sheffield, United Kingdom
This paper introduces a new model for automatic speech recognition (ASR) called TEMM - Temporal Episodic Memory Model. TEMM is derived from a simulation of human episodic memory called MINERVA2, and it not only overcomes the inability of MINERVA2 to use temporal sequence flexibly for recognition, but it also employs a prediction mechanism as an additional source of information. The performance of TEMM on an ASR task is compared to state-of-the-art HMM/GMM baseline systems, and a first analysis shows competitive results and a need to further stabilise the consistency of the output of the new model.
WeB.SS‑2
Speech Recognition with Factorial-HMM Syllabic Acoustic Models
Gianpaolo Coro, Department of Physics, University of Naples “Federico II”, Naples, Italy and ABLA srl Milan, Italy
Francesco Cutugno, Department of Physics, University of Naples “Federico II”, Naples, Italy
Fulvio Caropreso, Department of Physics, University of Naples “Federico II”, Naples, Italy
Classic approaches in Automatic Speech Recognition are not able to catch all the information lying in a speech signal; furthermore decoding procedures have real time constraints preventing the system to achieve optimal alignment between acoustic models and signal. In this paper, we present an approach to speech recognition in which Factorial Hidden Markov Models (FHMM) are used as syllabic acoustic models. An alignment algorithm is used for unit decoding. As applicative domain we choose numbers uttered in Italian. Syllabic accuracy in our model is 84.81%, correctness on numbers is 77.74%. Aim of the experiment is to show that the performances of FHMMs lie in the ability to retrieve the presence of two different temporal dynamics in a speech segments: the former with a quasi-segmental timing, the latter presenting a quasi-syllabic trend. Moreover, we evaluate a unit decoding process based on a dynamic programming algorithm in order to exploit the acoustic models performances at best.
WeB.SS‑3
Evaluating Acoustic Distance Measures for Template Based Recognition
Mathias De Wachter, K.U.Leuven - Dept. ESAT
Kris Demuynck, K.U.Leuven - Dept. ESAT
Patrick Wambacq, K.U.Leuven - Dept. ESAT
Dirk Van Compernolle, K.U.Leuven - Dept. ESAT
In this paper we investigate the behaviour of different acoustic distance measures for template based speech recognition in light of the combination of acoustic distances, linguistic knowledge and template concatenation fluency costs. To that end, different acoustic distance measures are compared on tasks with varying levels of fluency/linguistic constraints. We show that the adoption of those constraints invariably results in an acoustically clearly suboptimal template sequence being chosen as the winning hypothesis. There are strong implications for the design of acoustic distance measures: distance measures that are optimal for frame based classification may prove to be suboptimal for full sentence recognition. In particular, we show this is the case when comparing the Euclidean and the recently introduced adaptive kernel local Mahalanobis distance measures.
WeB.SS‑4
Hierarchical Acoustic Modeling Based on Random-Effects Regression for Automatic Speech Recognition
Yan Han, Department of Language and Speech, Radboud University of Nijmegen, the Netherlands
Lou Boves, Department of Language and Speech, Radboud University of Nijmegen, the Netherlands
Recent research on human intelligence suggests that the auditory system has a hierarchical structure, in which the lower levels store individual properties, and the upper levels store the group properties of utterances. However, most of the conventional automatic recognizers adopt a single level model structure. In structure-based based models, such as HMM and parametric trajectory models, only the group properties of utterances are modeled. In template-based models, only the individual properties of utterances are exploited. In this paper, we propose a novel hierarchical acoustic model to simulate the human auditory hierarchy, where both the group and the individual properties can be explicitly addressed. Furthermore, we developed two evaluation methods, namely top-down and bottom-up test, to simulate the prediction-verification loops in human hearing. The proposed hierarchical model significantly outperforms parametric trajectory models on TIMIT vowel classification task.
WeB.SS‑5
Construction and Analysis of Multiple Paths in Syllable Models
Annika Hämäläinen, Radboud University Nijmegen
Louis ten Bosch, Radboud University Nijmegen
Lou Boves, Radboud University Nijmegen
In this paper, we construct multi-path syllable models using phonetic knowledge for initialising the parallel paths, and a data-driven solution for their re-estimation. We hypothesise that the richer topology of multi-path syllable models would be better at accounting for pronunciation variation than context-dependent phone models that can only account for the effects of left and right neighbours. We show that parallel paths that are initialised with phonetic knowledge and then re-estimated do indeed result in different trajectories in feature space. Yet, this does not result in better recognition performance. We suggest explanations for this finding, and provide the reader with important insights into the issues playing a role in pronunciation variation modelling with multi-path syllable models.
WeB.SS‑6
Landmark-based Approach to Speech Recognition: An Alternative to HMMs
Carol Espy-Wilson, Institute for Systems Research and Dept. of Electrical and Computer Eng., University of Maryland, College Park, MD 20742, USA
Tarun Pruthi, Institute for Systems Research and Dept. of Electrical and Computer Eng., University of Maryland, College Park, MD 20742, USA
Amit Juneja, Think-A-Move, Ltd., Beachwood, OH 44122, USA
Om Deshmukh, Institute for Systems Research and Dept. of Electrical and Computer Eng., University of Maryland, College Park, MD 20742, USA
In this paper, we compare a Probabilistic Landmark-Based speech recognition System (LBS) which uses Knowledge-based Acoustic Parameters (APs) as the front-end with an HMM-based recognition system that uses the Mel-Frequency Cepstral Coefficients as its front end. The advantages of LBS based on APs are (1) the APs are normalized for extra-linguistic information, (2) acoustic analysis at different landmarks may be performed with different resolutions and with different APs, (3) LBS outputs multiple acoustic landmark sequences that signal perceptually significant regions in the speech signal, (4) it may be easier to port this system to another language since the phonetic features captured by the APs are universal, and (5) LBS can be used as a tool for uncovering and subsequently understanding variability. LBS also has a probabilistic framework that can be combined with pronunciation and language models in order to make it more scalable to large vocabulary recognition tasks.
WeB.SS‑7
Automatic Recognition of Connected Vowels Only Using Speaker-invariant Representation of Speech Dynamics
Satoshi Asakawa, The University of Tokyo
Nobuaki Minematsu, The University of Tokyo
Keikichi Hirose, The University of Tokyo
Speech acoustics vary due to differences in gender, age, microphone, room, lines, etc. In speech recognition research, to deal with these inevitable non-linguistic variations, thousands of speakers in different acoustic conditions were prepared to train acoustic models of individual phonemes. Recently, a novel representation of speech dynamics was proposed, where the above non-linguistic factors are effectively removed from speech. This representation captures only speaker-invariant speech dynamics and no absolute acoustic properties such as spectrums are used. In our previous study, the new representation was applied to recognizing a sequence of isolated vowels. The proposed method with a single training speaker outperformed the conventional HMMs trained with more than four thousand speakers even in the case of noisy speech. The current paper shows the initial results of applying the dynamic representation to recognizing continuous speech, that is connected vowels.
WeB.SS‑8
A Structured Speech Model Parameterized by Recursive Dynamics and Neural Networks
Roberto Togneri, The University of Western Australia
Li Deng, Microsoft Research
We present in this paper an overview of the Hidden Dynamic Model (HDM) paradigm, exemplifying parametric construction of structure-based speech models that can be used for recognition purposes. We explore a general class of the HDM that uses recursive, autoregression functions to represent the hidden speech dynamics, and uses neural networks to represent the functional relationship between the hidden and observed speech vectors. This type of state-space formulation of the HDM is reviewed in terms of model construction, a parameter estimation technique, and a decoding method. We also present some typical experimental results on the use of this type of HDMs for phonetic recognition and for automatic vocal tract resonance tracking. We further provide analyses on the computational complexity (for decoding) and the parameter size of the HDM in comparison with the HMM. Finally, we discuss several key issues related to future exploration of the HDM paradigm.
WeB.SS‑9
Structure-Based and Template-Based Automatic Speech Recognition - Comparing parametric and non-parametric approaches
Li Deng, Microsoft Research, One Microsoft Way, Redmond, WA, USA
Helmer Strik, CLST, Department of Linguistics, Radboud University, Nijmegen, the Netherlands
This paper provides an introductory tutorial for the Interspeech07 special session on "Structure-Based and Template-Based Automatic Speech Recognition". The purpose of the special session is to bring together researchers who have special interest in novel techniques that are aimed at overcoming weaknesses of HMMs for acoustic modeling in speech recognition. Numerous such approaches have been taken over the past dozen years, which can be broadly classified into structured-based (parametric) and template-based (non-parametric) ones. In this paper, we will provide an overview of both approaches, focusing on the incorporation of long-range temporal dependencies of the speech features and phonetic detail in speech recognition algorithms. We will provide a high-level survey on major existing work and systems using these two types of "beyond-HMM" frameworks. The contributed papers in this special session will elaborate further on the related topics.
WeB.SS‑10
Learning the Inter-frame Distance for Discriminative Template-based Keyword Detection
David Grangier, IDIAP Research Institute
Samy Bengio, Google Inc.
This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.
WeB.SS‑11
Handling Phonetic Context and Speaker Variation in a Structure-Based Speech Recognizer
Dong Yu, Microsoft Research
Li Deng, Microsoft Research
Alex Acero, Microsoft Research
Recently we have developed a novel type of structure-based speech recognizer, which uses parameterized, non-recursive “hidden” trajectory model of vocal tract resonances (VTR) to capture the dynamic structure of long-range speech coarticulation and reduction. In this paper, we elaborate on two key aspects of the model. First, the phonetic context controls the movement direction and thus the formation of the VTR trajectories. This provides “structured” context dependency for speech acoustics without using context dependent parameters as required by HMMs. Second, VTR targets as the key context-independent parameters of the model vary across speakers. We describe an effective target-value normalization algorithm that can be applied to both training and test speakers. We report experimental results demonstrating the effectiveness of the normalization algorithm in the context of structure-based speech recognition and provide computational analysis on the HTM-based speech decoder.