Interspeech 2007 Session ThB.P3b: Lexical and prosodic modelling
Thursday, August 30, 2007
10:00 – 12:00
Lori Lamel (CNRS-LIMSI/TLP)
Use of syllable center detection in duration models for improved Chinese Mandarin connected digits recognition in cars
Sergey Astrov, Corporate Technology, Siemens AG
Joachim Hofer, Corporate Technology, Siemens AG
Harald Höge, Corporate Technology, Siemens AG
The paper describes practical approaches for improving Mandarin digit recognition accuracy in cars. We consider syllable and subword unit durations as additional source of information. The approach was realized in two stages. First, the system performs standard speech recognition using acoustic spectral features. As a result, an n-best list of hypotheses is generated. In the second stage the hypothesis probabilities are re-estimated using duration models, thus, the hypotheses are reordered such that the correct ones are pushed to the top of the n-best list. In such a way the word error rate (WER) is reduced. We explore duration n-grams: in order to eliminate the influence of speech rate variations, the durations are normalized to a relative speech rate, a 10% relative reduction of WER was achieved. A novel approach led to 13.3% WER reduction: the durations were normalized to a syllable rate obtained from the syllable center detector.
Using phonetic features in unsupervised word decompounding for ASR with application to a less-represented language
Thomas Pellegrini, LIMSI-CNRS
Lori Lamel, LIMSI-CNRS
In this paper, a data-driven word decompounding algorithm is described and applied to a broadcast news corpus in Amharic. The baseline algorithm has been enhanced in order to address the problem of increased phonetic confusability arising from word decompounding by incorporating phonetic properties and some constraints on recognition units derived from prior forced alignment experiments. Speech recognition experiments have been carried out to validate the approach. Out of vocabulary (OOV) words rates can be reduced by 30% to 40% and an absolute Word Error Rate (WER) reduction of 0.4% has been achieved. The algorithm is relatively language independent and requires minimal adaptation to be applied to other languages.
Robust F0 Modeling for Mandarin Speech Recognition in Noise
Sheng Qiang, College of Computer Science and Technology, Zhejiang University,China
Yao Qian, Microsoft Research Asia, Beijing, China
Frank K. Soong, Microsoft Research Asia, Beijing, China
Congfu Xu, College of Computer Science and Technology, Zhejiang University,China
F0 contour plays an important role in recognizing spoken tonal languages like Mandarin Chinese. However, the discontinuity of F0 between voiced and unvoiced transition has traditionally been a bottleneck in creating a succinct statistical tone model for automatic speech recognition applications. Recently, we applied Multi-Space Distribution (MSD) to Mandarin tone modeling and reported a relative 24% reduction of tonal syllable errors. In this paper, we test MSD further in a noisy, continuous Mandarin digit recognition task, where eight noises are added to clean speech at five SNRs. The experimental results show that our MSD-based digit models can significantly improve the recognition performance in noise over a baseline system. Relative digit error rate reductions of 19.1% and 15.0% are obtained for noises seen and unseen in the training data, respectively. The improvements are also better than other reference systems where F0 information is incorporated.
Word Duration Modeling for Word Graph Rescoring in LVCSR
Dino Seppi, FBK-irst, formerly ITC-irst, Trento, Italy
Daniele Falavigna, FBK-irst, formerly ITC-irst, Trento, Italy
Georg Stemmer, Siemens AG, Corporate Technology, Munich, Germany
Roberto Gretter, FBK-irst, formerly ITC-irst, Trento, Italy
A well-known unfavorable property of HMMs in speech recognition is their inappropriate representation of phone and word durations. This paper describes an approach to resolve this limitation by integrating explicit word duration models into an HMM-based speech recognizer. Word durations are represented by log-normal densities using a back-off strategy that approximates durations of words that have been observed seldom by a combination of the statistics of suitable sub-word units. Furthermore, two different normalization procedures are compared which reduce the influence of the implicit HMM duration distribution resulting from the state-to-state transition probabilities. Experiments on European parliamentary speeches in English and Spanish language show that the proposed approaches are effective and lead to small, but consistent reductions in the word error rate for large-vocabulary speech recognition tasks.
On Automatic Prominence Detection for German
Fabio Tamburini, DSLO - University of Bologna - ITALY
Petra Wagner, IFK - University of Bonn - Germany
Perceptual prominence is an important indicator of a word’s and syllable’s lexical, syntactic, semantic and pragmatic status in a discourse. Its automatic annotation would be a valuable enrichment of large databases used in unit selection speech synthesis and speech recognition. Previous approaches to German relied on linguistic features in prominence detection, but a purely acoustic method would be advantageous. We applied an algorithm to German data that had been previously used for English and Italian. Both the algorithm and the data annotation encode prominence as a continuous rather than a categorical parameter. First results are encouraging, but again show that prominence perception relies on linguistic expectancies as well as acoustic patterns. Also, our results further strengthen the view that force accents are a more reliable cue to prominence than pitch accents in German.
Prosody-enriched lattices for improved syllable recognition
Sankaranarayanan Ananthakrishnan, University of Southern California
Shrikanth Narayanan, University of Southern California
Automatic recognition of syllables is useful for many spoken language applications such as speech recognition and spoken document retrieval. Short-term spectral properties (such as mel-frequency cepstral coefficients, or MFCCs) are usually the features of choice for such systems, which typically ignore suprasegmental (prosodic) cues that manifest themselves at the syllable, word and utterance level. Previous work has shown that categorical representations of prosody correlate well with lexical entities. In this paper, we attempt to exploit this relationship by enriching syllable-level lattices, generated by a standard speech recognizer, with categorical prosodic events for improved syllable recognition performance. With the enriched lattices, we obtain a 2% relative improvement in syllable error rate over the baseline system on a read speech task (the Boston University Radio News Corpus).
Exploiting Phoneme Similarities in Hybrid HMM-ANN Keyword Spotting
Joel Pinto, IDIAP Research Institute
Andrew Lovitt, University of Illinois at Urbana-Champaign
Hynek Hermansky, IDIAP Research Institute
We propose a technique for generating alternative models for keywords in a hybrid hidden Markov model - artificial neural network (HMM-ANN) keyword spotting paradigm. Given a base pronunciation for a keyword from the lookup dictionary, our algorithm generates a new model for a keyword which takes into account the systematic errors made by the neural network and avoiding those models that can be confused with other words in the language. The new keyword model improves the keyword detection rate while minimally increasing the number of false alarms.
Online Vocabulary Adaptation using Limited Adaptation Data
Chang E Liu, Microsoft Research Asia
Kishan Thambiratnam, Microsoft Research Asia
Frank Seide, Microsoft Research Asia
This paper presents a study of low-latency domain-independent online vocabulary adaptation using limited amounts of supporting text data. The target applications include blind indexing of Internet content, indexing of new content with low latency, and domains where Out-Of-Vocabulary (OOV) words are problematic. A number of methods to perform document-specific adaptation using a small amount of support metadata and the Internet are examined. It is shown that a combination of word feature fusion and cross-file statistics pooling provides robust adaptation. The best evaluated method achieved an absolute reduction of 27.6% in OOV detection false alarm rate over the baseline word feature thresholding methods.