Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session ThB.P2b: Confidence measures (and related topics)

Type poster
Date Thursday, August 30, 2007
Time 10:00 – 12:00
Room Alpaerts
Chair Hynek Hermansky (IDIAP Research Institute, Martigny)


Unsupervised re-scoring of observation probability in Viterbi based on reinforcement learning by using confidence measure and HMM neighborhood
Carlos Molina, Universidad de Chile
Nestor Becerra Yoma, Universidad de Chile
Fernando Huenupan, Universidad de Chile
Claudio Garreton, Universidad de Chile

This paper proposes a new paradigm to compensate for mismatch condition in speech recognition. A two-step Viterbi decoding based on reinforcement learning is proposed. The idea is to strength or weaken HMM’s by using Bayes-based confidence measure (BBCM) and distances between models. If HMMs in the N-best list show a low BBCM, the second Viterbi decoding will prioritize the search on neighboring models according to their distances to the N-best HMMs. As shown here, a reduction of 6% in WER is achieved in a task which results difficult for standard MAP and MLLR adaptation.

Optimization on Decoding Graphs by Discriminative Training
Shiuan-Sung Lin, GET/ENST and CNRS/LTCI, UMR 5141
François Yvon, GET/ENST and CNRS/LTCI, UMR 5141

The three main knowledge sources used in the automatic speech recognition (ASR), namely the acoustic models, a dictionary and a language model, are usually designed and optimized in isolation. Our previous work [1] proposed a methodology for jointly tuning these parameters, based on the integration of the resources as a finite-state graph, whose transition weights are trained discriminatively. This paper extends the training framework to a large vocabulary task, the automatic transcription of French broadcast news. We propose several fast decoding techniques to make the training practical. Experiments show that a reduction of 1% absolute of word error rate (WER) can be obtained. We conclude the paper with an appraisal of the potential of this approach on large vocabulary ASR tasks.

Morphosyntactic Processing of N-Best Lists for Improved Recognition and Confidence Measure Computation
Stéphane Huet, IRISA
Guillaume Gravier, IRISA
Pascale Sébillot, IRISA

We study the use of morphosyntactic knowledge to process N-best lists. We propose a new score function that combines the parts of speech (POS), language model, and acoustic scores at the sentence level. Experimental results, obtained for French broadcast news transcription, show a significant improvement of the word error rate with various decoding criteria commonly used in speech recognition. Interestingly, we observed more grammatical transcriptions, which translates into a better sentence error rate. Finally, we show that POS knowledge brings no improvement to classical confidence measures.

How Predictable is ASR Confidence in Dialog Applications?
Xiang Li, IBM T. J. Watson Research Center
Juan Huerta, IBM T. J. Watson Research Center

ASR confidence is a metric that reflects, to a large extent, the conditions under which a recognition task is being carried out as well as the reliability of the result. Because of this, ASR confidence constitutes a potentially useful feature in frameworks that attempt to asses the state of a dialog. In this paper we evaluate the predictability of ASR confidence based on knowledge of previously observed context-dependent confidences. We find out that the contextual confidence can be predicted with a standard prediction deviation less than 10% of the dynamic range of the confidence score, which represents a almost 40% relative reduction in standard deviation measure to a static confidence assumption baseline. Because our prediction is based on context, this predictability can be leveraged to produce an estimate of the expected average confidence until the end of a call based on the context path expected to be traversed.

Error detection in confusion network
Alexandre Allauzen, LIMSI-CNRS

In this article, error detection for broadcast news transcription system is addressed in a post-processing stage. We investigate a logistic regression model based on features extracted from confusion networks. This model aims to estimate a confidence score for each confusion set and detect errors. Different kind of knowledge sources are explored such as the confusion set solely, statistical language model, and lexical properties. Impact of the different features are assessed and show the importance of those extracted from the confusion network solely. To enrich our modeling with information about the neighborhood, features of adjacent confusion sets are also added to the vector of features. Finally, a distinct processing of confusion sets is also explored depending on the value of their best posterior probability. To be compared with the standard ASR output, our best system yields to a significant improvement of the classification error rate from 17.2% to 12.3%.

An Approach to Efficient Generation of High-Accuracy and Compact Error-Corrective Models for Speech Recognition
Takanobu Oba, NTT Corporation
Takaaki Hori, NTT Corporation
Atsushi Nakamura, NTT Corporation

This paper focuses on an error-corrective method by reranking of hypotheses in speech recognition. Some recent work investigated corrective models that can be used to rescore hypotheses so that a hypothesis with a smaller error rate has a higher score. Discriminative training such as perceptron algorithm can be used to estimate such corrective models. In discriminative training, how to choose competitors is an important factor because the model parameters are estimated from the difference between the reference (or oracle hypothesis) and the competitors. In this paper, we investigate the way how to choose effective competitors for training corrective models. Particularly we focus on word error rate (WER) of each hypothesis and show that a higher WER hypothesis rather than the best-scored one works effectively as a competitor. In addition, we show that using only one competitor with the highest WER in an N-best list is very effective to generate accurate and compact corrective models.

Detection of Out-of-Vocabulary Words in Posterior Based ASR
Hamed Ketabdar, IDIAP Research Institute, Martigny, Switzerland / Swiss Federal Institute of Technology at Lausanne, Switerland
Mirko Hannemann, IDIAP Research Institute, Martigny, Switzerland
Hynek Hermansky, IDIAP Research Institute, Martigny, Switzerland / Swiss Federal Institute of Technology at Lausanne, Switerland

Over the years, sophisticated techniques for utilizing the prior knowledge in the form of text-derived language model and in pronunciation lexicon evolved. However, their use has an undesirable effect: unexpected lexical items (words) in the phrase are replaced by acoustically acceptable in-vocabulary items. This is the major source of error since the replacement often introduces additional errors. Improving the machine ability to handle these unexpected words would considerably increase the utility of speech recognition technology. A new technique for discovery of unexpected out-of-vocabulary words, which is based on comparison of two phoneme posterior streams derived from the identical acoustic evidence while using two different sets of prior constraints, and which does not require any segment boundary decisions, is proposed.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo