Interspeech 2007 Session WeD.P3: ASR: new paradigms
Wednesday, August 29, 2007
16:00 – 18:00
Jeff Bilmes (University of Washington, Seattle)
Modeling Context and Language Variation for Non-Native Speech Recognition
Tien Ping Tan, Laboratoire d’Informatique de Grenoble
Laurent Besacier, Laboratoire d’Informatique de Grenoble
Non-native speakers often face difficulty in pronouncing like the native speakers. This paper proposes to model pronunciation variation in non-native speaker’s speech using only acoustics models, without the need for the corpus. Variation in term of context and language will be modeled. The combination of both modeling resulted in the reduction of absolute WER as much as 16% and 6% for native Vietnamese and Chinese speakers of French.
An Evaluation of Cross-Language Adaptation and Native Speech Training for Rapid HMM Construction Based on Very Limited Training Data
Xufang Zhao, Institut national de la recherche scientifique, Universite du Quebec
Douglas O'Shaughnessy, Institut national de la recherche scientifique, Universite du Quebec
As the needs and opportunities for speech technology applications in a variety of languages have grown, methods for rapid transfer of speech technology across languages have become a practical concern. Previous works focus on the comparison of different adaptation algorithms such as MAP and MLLR. However, an interesting point is that, with increasing adaptation corpora, the direct native speech training may already outperform the cross-language adaptation. If it is true, there should be a threshold for the size of an adaptation corpus. Usually, transferring acoustic knowledge is useful when there is not enough training data available. This paper presents a systematic comparison of the relative effectiveness of cross-language adaptation and native speech training using transfer from English to Mandarin as a test case. This study found that cross-language adaptation does not produce better acoustic models than the direct native speech training approach even using limited training data.
Never-Ending Learning with Dynamic Hidden Markov Network
Konstantin Markov, Spoken Language Communication Group, NICT-ATR, Japan
Satoshi Nakamura, Spoken Language Communication Group, NICT-ATR, Japan
In this paper, we present a new speech model which is a network of hidden Markov states capable of unsupervised on-line adaptive learning and preserving the previously acquired knowledge. Speech patterns are represented by state sequences through the network. The network can detect unseen patterns, and if such pattern is encountered, it is learned by adding new states and transitions to the network. States that are rarely visited are gradually removed. Thus, the network can grow and shrink when needed, i.e. it dynamically changes its structure. The learning process continues indefinitely, so it is called never-ending learning. The output of the network is the best state sequence and the decoding is done concurrently with the learning. Initial experiments with a small database of isolated spelled letters showed that the proposed model is indeed capable of never-ending learning and can perfectly recognize previously learned speech patterns.
Building Multiple Complementary Systems using Directed Decision Trees
Catherine Breslin, Cambridge University Engineering Department
Mark Gales, Cambridge University Engineering Department
Large vocabulary speech recognition systems typically use a combination of multiple systems to obtain the final hypothesis. For combination to give gains, the systems being combined must be complementary, i.e. they must make different errors. Often, complementary systems are chosen simply by training multiple systems, performing all combinations, and selecting the best. This approach becomes time consuming as more potential systems are considered, and hence recent work has looked at explicitly building systems to be complementary to each other. This paper considers building multiple complementary systems based on directed decision trees, and combining them within a multi-pass adaptive framework. The tree divergence is introduced for easy comparison of trees without having to build entire systems. Experiments are presented on a Broadcast News Arabic task, and show that gains can be achieved by using more than one complementary system.
Automatic Speech Recognition Framework for Multilingual Audio Contents
Hiroaki Nanjo, Faculty of Science and Technology, Ryukoku University
Yuichi Oku, Faculty of Science and Technology, Ryukoku University
Takehiko Yoshimi, Faculty of Science and Technology, Ryukoku University
Automatic speech recognition (ASR) for multilingual audio contents, such as international conference recordings and broadcast news, is addressed. For handling such contents efficiently, a simultaneous ASR is promising. Conventionally, ASR has been performed independently, namely language by language, although multilingual speech, which consists of utterances in several languages representing the same meaning, is available. In this paper, we discuss a bilingual speech recognition framework based on statistical ASR and machine translation (MT) in which bilingual ASR is performed simultaneously and complementarily. Then, according to Japanese speech recognition with corresponding English text and MT, we shows the framework works well.
Combinined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition
Ghazi Bouselmi, LORIA
Dominique Fohr, LORIA
Irina Illina, LORIA
In this paper, we present several adaptation methods for non-native speech recognition. We have tested pronunciation modelling, MLLR and MAP non-native pronunciation adaptation and HMM models retraining on the HIWIRE foreign accented English speech database. The "phonetic confusion'' scheme we have developed consists in associating to each spoken phone several sequences of confused phones. In our experiments, we have used different combinations of acoustic models representing the canonical and the foreign pronunciations: spoken and native models, models adapted to the non-native accent with MAP and MLLR. The joint use of pronunciation modelling and acoustic adaptation led to further improvements in recognition accuracy. The best combination of the above mentioned techniques resulted in a relative word error reduction ranging from 46% to 71%.
Automatic Estimation of Scaling Factors among Probabilistic Models in Speech Recognition
Tadashi Emori, Media and Information Research Laboratories, NEC Corporation, Kawasaki, Japan
Yoshifumi Onishi, Media and Information Research Laboratories, NEC Corporation, Kawasaki, Japan
Koichi Shinoda, Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan
We propose an efficient new method for estimating scaling factors among probabilistic models in speech recognition. Most speech recognition systems consist of an acoustic and a language model, and require scaling factors to balance probabilities among them. The scaling factors are conventionally optimized in recognition tests. In our proposed method, the scaling factors are regarded as parameters of a log-linear model, and they are estimated using a gradient-ascent method based on the maximum a posteriori probability criterion. Posterior probability is computed using word-lattices. We employ an iteration technique which repeats a word-lattice-generation/scalingfactor-estimation process, and the resulting scaling factor estimation is robust with respect to the changes in initial values. In experiments, estimated scaling factors were nearly identical to optimal values obtained in a greedy grid search, and they changed little with variations in initial values.
Memory Efficient Modeling of Polyphone Context with Weighted Finite-State Transducers
Emilian Stoimenov, Universität Karlsruhe
John McDonough, Saarland University
In earlier work, we derived a transducer HC that translates from sequences of Gaussian mixture models directly to phone sequences. The HC transducer was statically expanded then determinized and minimized. In this work, we present a refinement of the correct algorithm whereby the initial HC transducer is incrementally and immediately determinized. This technique avoids the need for a full expansion of the initial HC, and thereby reduces the random access memory required to produce the determinized HC by a factor of more than five. With the incremental algorithm, we were able to construct HC for a semi-continous acoustic model with 16,000 distributions which reduced the word error rate from 34.1% to 32.9% with respect a fully-continuous system with 4,000 distributions on the lecture meeting portion of the NIST RT05 data.
Extra Large Vocabulary Continuous Speech Recognition Algorithm based on Information Retrieval
Valeriy Pylypenko, Department of Speech and Synthesis International Research/Training Center for Information Technologies and Systems, Kyiv, Ukraine
This paper presents a new two-pass algorithm for Extra Large (more than 1M words) Vocabulary COntinuous Speech recognition based on the Information Retrieval (ELVIRCOS). The principle of this approach is to decompose a recognition process into two passes where the first pass builds the words subset for the second pass recognition by using information retrieval procedure. Word graph composition for continuous speech is presented. With this approach a high performances for large vocabulary speech recognition can be obtained.
PocketSUMMIT: Small-Footprint Continuous Speech Recognition
I. Lee Hetherington, MIT CSAIL
We present PocketSUMMIT, a small-footprint version of our SUMMIT continuous speech recognition system. With portable devices becoming smaller and more powerful, speech is increasingly becoming an important input modality on these devices. PocketSUMMIT is implemented as a variable-rate continuous density hidden Markov model with diphone context-dependent models. We explore various Gaussian parameter quantization schemes and find 8:1 compression or more is achievable with little reduction in accuracy. We also show how the quantized parameters can be used for rapid table lookup. We explore first-pass language model pruning in a finite-state transducer (FST) framework, as well as FST and n-gram weight quantization and bit packing, to further reduce memory usage. PocketSUMMIT is currently able to run a moderate vocabulary conversational speech recognition system in real time in a few MB on current PDAs and smart phones.
Development of Preschool Children Subsystem for ASR and Q&A in a Real-Environment Speech-oriented Guidance Task
Tobias Cincarek, Nara Institute of Science and Technology, Japan
Izumi Shindo, Nara Institute of Science and Technology, Japan
Tomoki Toda, Nara Institute of Science and Technology, Japan
Hiroshi Saruwatari, Nara Institute of Science and Technology, Japan
Kiyohiro Shikano, Nara Institute of Science and Technology, Japan
The development of a module for speech recognition and answer generation of preschool children for a speech-oriented guidance system is described. This topic requires extra treatment because the performance is still disproportionally low to children of higher age, there is a growing business demand and only relatively few research has been carried out. This is especially true for building practical applications. A realenvironment speech database with more than 12,000 utterances from Japanese preschool children and more than 60,000 utterances from school children are employed for system development. The difference between preschool children's and standard pronunciation is narrowed by introducing uniform reference transcriptions and pronunciation modeling. Furthermore, language and acoustic model are optimized.
A Study on Word Detector Design and Knowledge-based Pruning and Rescoring
Chengyuan Ma, Georgia Institute of Technology
Chin-Hui Lee, Georgia Institute of Technology
This paper presents a two-stage approach, keyword-filler network method followed by knowledge-based pruning and rescoring, for detection of any given word in continuous speech. Different from conventional keyword spotting systems, both content words and function words are considered in this study. To reduce the high miss, a modified grammar network for word detection is proposed. Then knowledge sources from landmark detection, attributes detection and other spectral cues were combined together to remove the unlikely putative segments from the hypothesized word candidates. This study has been evaluated on the WSJ0 corpus under matched and mismatched acoustic conditions. When comparing with the conventional keyword spotting system, we found the proposed word detector greatly improves the detection performance. The figure-of-merits for content and function words were improved from 48.8% to 61.5%, and 22.3% to 33.1% respectively.
Parameter Tuning for Fast Speech Recognition
Thomas Colthurst, BBN Technologies
Tresi Arvizo, BBN Technologies
Chia-Lin Kao, BBN Technologies
Owen Kimball, BBN Technologies
Stephen Lowe, BBN Technologies
David Miller, BBN Technologies
Jim Sciver, BBN Technologies
We describe a novel method for tuning the decoding parameters of a speech-to-text system so as to minimize word error rate (WER) subject to an over-all time constraint. When applied to three sub-realtime systems for recognizing English conversational telephone speech, the method gave speed improvements of up to 21.1% while at the same time reducing WER by up to 6.7%.
A computational model for unsupervised word discovery
Louis ten Bosch, Radboud University, Nijmegen
Bert Cranen, Radboud University, Nijmegen
We present an unsupervised algorithm for the discovery of words and word-like fragments from the speech signal, without using an upfront defined lexicon or acoustic phone models. The algorithm is based on a combination of acoustic pattern discovery, clustering, and temporal sequence learning. In its current form, the algorithm is able to discover words in speech with low perplexity (connected digits). Although its performance still falls off compared to mainstream ASR approaches, the value of the algorithm is its potential to serve as a computational model in two research directions. First, the algorithm may lead to an approach for speech recognition that is fundamentally liberated from the modelling constraints in conventional ASR. Second, the proposed algorithm can be interpreted as a computational model of language acquisition that takes actual speech as input and is able to find words as 'emergent' properties from raw input.
Phoneme Confusions in Human and Automatic Speech Recognition
Bernd T. Meyer, University of Oldenburg
Matthias Wächter, University of Oldenburg
Thomas Brand, University of Oldenburg
Birger Kollmeier, University of Oldenburg
A comparison between automatic speech recognition (ASR) and human speech recognition (HSR) is performed as prerequisite for identifying sources of errors and improving feature extraction in ASR. HSR and ASR experiments are carried out with the same logatome database which consists of nonsense syllables. Two different kinds of signals are presented to human listeners: First, noisy speech samples are converted to Mel-frequency cepstral coefficients which are resynthesized to speech, with information about voicing and fundamental frequency being discarded. Second, the original signals with added noise are presented, which is used to evaluate the loss of information caused by the process of resynthesis. The analysis also covers the degradation of ASR caused by dialect or accent and shows that different error patterns emerge for ASR and HSR. The information loss induced by the calculation of ASR features has the same effect as a deteriation of the SNR by 10 dB.
Construction of Spoken Language Model Including Fillers Using Filler Prediction Model
Kengo Ohta, Department of Information and Computer Sciences, Toyohashi University of Technology, Japan
Masatoshi Tsuchiya, Information and Media Center, Toyohashi University of Technology, Japan
Seiichi Nakagawa, Department of Information and Computer Sciences, Toyohashi University of Technology, Japan
This paper proposes a novel method to construct a spoken language model including fillers from a corpus including no fillers using a filler prediction model. It consists of two sub-models: a filler insertion model which predicts places where fillers should be inserted, and a filler selection model which predicts appropriate fillers for given places. It converts a corpus that covers domain-relevant topics but includes no fillers into a corpus that contains fillers as well as domain-relevant topics. The experiment against the corpus of spontaneous Japanese shows that language models constructed by the proposed method achieve quite near performance of the traditional trigram language model constructed from the real spontaneous corpus including fillers.
Attention Shift Decoding for Conversational Speech Recognition
Raghunandan Kumaran, University of Washington
Jeff Bilmes, University of Washington
Katrin Kirchhoff, University of Washington
We introduce a novel approach to decoding in speech recognition (termed attention-shift decoding) that attempts to mimic aspects of human speech recognition responsible for robustness in processing conversational speech. Our approach is a radical departure from traditional decoding algorithms for speech recognition. We propose a method to first identify reliable regions of the speech signal and then use these to help decode the unreliable regions, thus conditioning on potentially non-consecutive portions of the signal. We test this approach in a second-pass rescoring framework and compare it to standard second-pass rescoring. On a conversational telephone speech recognition task (EARS RT-03 CTS evaluation), our approach shows an improvement of 2.6% absolute when using oracle information for detecting the reliable regions, and 0.4% absolute when detecting the reliable regions automatically.