Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session WeD.P1: Resource acquisition and preparation; resource and system evaluation

Type poster
Date Wednesday, August 29, 2007
Time 16:00 – 18:00
Room Foyer
Chair Harald Höge (Siemens, Munich)


JAAE: The Java Abstract Annotation Editor
Miloslav Konopik, University of West Bohemia, FAS, DCSE, Czech Republic
Ivan Habernal, University of West Bohemia, FAS, DCSE, Czech Republic

Recent trends in NLP (Natural Language Processing) are heading towards stochastic processing of the natural language. Stochastic methods, however, usually demand a lot of annotated training data. In most cases, the annotation of the data has to be done manually by a team of annotators and it is a highly time-consuming and expensive process. Thus we tried to develop an efficient and user friendly editor that would aid the human annotators to create the annotated data. We offer this editor for free. In this article, the developed editor is described.

How to Judge Reusability of Existing Speech Corpora for Target Task by Utilizing Statistical Multidimensional Scaling
Goshu Nagino, Speech Solutions, New Business Development, Asahi Kasei Corporation
Makoto Shozakai, Speech Solutions, New Business Development, Asahi Kasei Corporation
Kiyohiro Shikano, Graduate School of Information Science, Nara Institute of Science and Technology

In order to develop a target speech recognition system with less cost of time and money, reusability of existing speech corpora is becoming one of the most important issues. This paper proposes a new technique to judge the reusability of existing speech corpora for a target task by utilizing a statistical multidimensional scaling method. In an experiment using twelve tasks in five speech corpora, our proposed method could show high correlation to the cross task recognition performance and judge the reusability of existing speech corpora correctly for the target task with lower cost.

Feasibility of Constructing an Expressive Speech Corpus from Television Soap Opera Dialogue
Peter Rutten, VRT medialab (Flemish Radio and Television)

This paper presents a study into the feasibility of extracting a corpus of expressive speech from television soap opera dialogue. We investigated how dialogue can be extracted from television production tapes, and what kind of signal quality may be expected. We analysed to what extent the scripts that are used in television production can provide a transcription of the actual dialogue. From the scripts we also estimated how much dialogue speech we can expect to find for each character. We based our analysis on 7 seasons (1145 episodes) of a soap opera produced by the Flemish broadcaster VRT. The results show that processing 100 episodes can result in 3 hours of speech for one of the main characters, or 2.5 hours of dialogue between two of the main characters. The scripts, however, do not provide a quick win for automatic annotation of the corpus - they do not provide sufficiently accurate transcriptions of the dialogue that was actually spoken by the actors.

Collection of empirical data for standardization of generic vocabularies in speech driven ICT devices and services
Rosemary Orr, European Telecommunications Standards Institute ETSI
Françoise Petersen, European Telecommunications Standards Institute ETSI
Helge Hüttenrauch, European Telecommunications Standards Institute ETSI
Martin Böcker, European Telecommunications Standards Institute ETSI
Mike Tate, European Telecommunications Standards Institute ETSI
Bernat Gonzalez i Llinares, University College Utrecht

This paper describes a method of collecting multilingual speech data for use in the compilation of spoken command vocabularies for ICT devices and services in the EU, the EFTA countries, Turkey and Russia. The resulting vocabularies will be published as a European Telecommunications Standards Institute (ETSI) standard, for use by industry in the production of such applications. The project is co-funded by EU/EFTA within the i2010 framework for addressing the main challenges and developments in ICT up to 2010.

Acoustic-Phonetic Features for Refining the Explicit Speech Segmentation
Antônio Marcos Selmini, Universidade Estadual de Campinas (UNICAMP)
Fábio Violaro, Universidade Estadual de Campinas (UNICAMP)

This paper describes the refinement of the automatic speech segmentation into phones obtained via Hidden Markov Models (HMM). This refinement is based on acoustic-phonetic features associated to different phone classes. The proposed system was evaluated using both a small speaker dependent Brazilian Portuguese speech database and a speaker independent speech database (TIMIT). The refinement was applied to the boundaries obtained by just running the Viterbi’s algorithm on the HMMs associated to the different utterances. Improvements of 30% and 13% were achieved in the percentage of segmentation errors below 20 ms for the speaker dependent and speaker independent databases respectively.

Text island spotting in large speech databases
Benjamin Lecouteux, Laboratoire Informatique d'Avignon (LIA) University of Avignon, France
Georges Linares, Laboratoire Informatique d'Avignon (LIA) University of Avignon, France
Frédéric Beaugendre, Voice-Insight Brussels, Belgium
Pascal Nocera, Laboratoire Informatique d'Avignon (LIA) University of Avignon, France

This paper addresses the problem of using journalist prompts or closed captions to build corpora for training speech recognition systems. Generally, these text documents are imperfect transcripts which suffer from the lack of timestamps. We propose a method combining a driven decoding algorithm and a fast-match process allowing to spot text-segments. This method is evaluated both on the French ESTER corpus and on a large database composed of records from the 'Radio Television Belge Francophone (RTBF)' associated to real prompts. Results show very good performance in terms of spotting; we observed a F-measure of about 98% on spotting the real text island provided by the RTBF corpus. Moreover, the decoding driven by the imperfect transcript island outperforms significantly the baseline system.

People Watcher: A Game for Eliciting Human-Transcribed Data for Automated Directory Assistance
Tim Paek, Microsoft Research
Yun-Cheng Ju, Microsoft Research
Christopher Meek, Microsoft Research

Automated Directory Assistance (ADA) allows users to request telephone or address information of residential and business listings using speech recognition. Because callers often express listings differently than how they are registered in the directory, ADA systems require transcriptions of alternative phrasings for directory listings as training data, which can be costly to acquire. As such, a framework in which data can be contributed voluntarily by large numbers of Internet users has tremendous value. In this paper, we introduce People Watcher, a computer game that elicits transcribed, alternative user phrasings for directory listings while at the same time entertaining players. Data generated from the game not only overlapped actual audio transcriptions, but resulted in a statistically significant 15% relative reduction in semantic error rate when utilized for ADA. Furthermore, semantic accuracy was not statistically different than using the actual audio transcriptions.

The Effect of Speech Interface Accuracy on Driving Performance
Andrew Kun, University of New Hampshire
Tim Paek, Microsoft Research
Zeljko Medenica, University of New Hampshire

Governments have been enacting legislation prohibiting the use of cell phones during driving without a “hands-free” kit, bringing automotive speech recognition to the forefront of public safety. At the same time, cell phones are becoming smaller with greater computational power, making speech the most viable modality for user input. Given the important role that automotive speech recognition is likely to play in consumer lives, we explore how the accuracy of the speech engine, the use of the push-to-talk button, and the type of dialog repair employed by the interface influences driving performance. In experiments conducted with a driving simulator, we found that the accuracy of the speech engine and its interaction with the use of the push-to-talk button does impact driving performance significantly, but the type of dialog repair employed does not. We discuss the implications of these findings on the design of automotive speech recognition systems.

Context Constrained-Generalized Posterior Probability for Verifying Phone Transcriptions
Hua Zhang, Institute of Automation, Chinese Academy of Sciences
Lijuan Wang, Microsoft Research Asia, Beijing, China
Frank Soong, Microsoft Research Asia, Beijing, China
Wenju Liu, Institute of Automation, Chinese Academy of Sciences

A new statistical confidence measure, Context Constrained-Generalized Posterior probability (CC-GPP), is proposed for verifying phone transcriptions in speech databases. Different from generalized posterior probability (GPP), CC-GPP is computed by considering string hypotheses that bear a focused phone with partially matched left and right contexts. Parameters used for CC-GPP include context window length, a minimal number of matched context phones, and verification thresholds. They are determined by minimizing verification errors in a development set. Evaluated on a test set of 500 sentences that consist of 2.1% phone errors, CC-GPP achieves 99.6% accuracy and 78.7% recall when 90% of the phones are accepted.

Getting Start with UTDrive: Driver-Behavior Modeling and Assessment of Distraction for In-Vehicle Speech Systems
Pongtep Angkititrakul, CRSS, University of Texas at Dallas
DongGu Kwak, CRSS, University of Texas at Dallas
SangJo Choi, CRSS, University of Texas at Dallas
JeongHee Kim, CRSS, University of Texas at Dallas
Anh PhucPhan, CRSS, University of Texas at Dallas
Amardeep Sathyanarayana, CRSS, University of Texas at Dallas
John H.L. Hansen, CRSS, University of Texas at Dallas

This paper describes our first step for advances in human-machine interactive sy stems for in-vehicle environments of the UTDrive project. UTDrive is part of an on-going international collaboration to collect and research rich multi-modal data recorded for modeling behavior while the driver is interacting with speech-activated systems or performing other secondary tasks. Simultaneously, another goal is to better understand speech characteristics of the driver undergoing additional cognitive load (e.g., driving a vehicle). The corpus consists of audio, video, brake/gas pedal pressure, forward distance, GPS information, and CAN-Bus messages. The resulting corpus, analysis, and modeling will contribute to more effective speech systems which are able to sense driver cognitive distraction/stress and adapt itself to the driver's cognitive capacity and driving situations for improved safety while driving.

Relative Evaluation of Informativeness in Machine Generated Summaries
BalaKrishna Kolluru, University of Sheffield
Yoshihiko Gotoh, University of Sheffield

This paper is concerned with the relative evaluation of information content in summaries. We study the effect of crossing the summary-question pairs for a comprehension test based summary evaluation. Using the scheme, machine generated and human authored summaries from the broadcast news stories are evaluated. The approach does not use absolute scores. Instead it relies on a relative comparison, effectively alleviating the subjectivity of individual summary authors. The evaluation indicates that less than half (44%) of information is shared between human authored summaries of roughly 15 words. On the other hand, 27% of information in machine generated summaries is shared with human authored summaries.

A Method for Evaluating Task-oriented Spoken Dialog Translation Systems Based on Communication Efficiency
Toshiyuki Takezawa, NiCT/ATR
Masahide Mizushima, NTT Labs.
Tohru Shimizu, NiCT/ATR
Genichiro Kikui, NTT Labs.

We propose a method for measuring communication efficiency from the viewpoint of conveying essential information in task-oriented spoken dialog translation. We present the results of one dialog experiment using speech-to-speech translation systems and a similar experiment using the Wizard of Oz method, which was carried out using hidden interpreters instead of a speech-to-speech translation system. We also present the relative performance score of the speech-to-speech translation system, which was obtained by comparing the machine's performance with that of humans, i.e., hidden interpreters. Finally, we discuss the relationship between users' linguistic behavior and system performance. We found that users of the system tended to make shorter utterances without decreasing the number of essential items needed to achieve a task and to improve transmission efficiency by using strategy to control dialogs.

Using Eye Movements for Online Evaluation of Speech Synthesis
Charlotte Van Hooijdonk, Department of Communication & Information Sciences, Tilburg University
Edwin Commandeur, Department of Communication & Information Sciences, Tilburg University
Reinier Cozijn, Department of Communication & Information Sciences, Tilburg University
Emiel Krahmer, Department of Communication & Information Sciences, Tilburg University
Erwin Marsi, Department of Communication & Information Sciences, Tilburg University

This paper describes an eye tracking experiment to study the processing of diphone synthesis, unit selection synthesis, and human speech taking segmental and suprasegmental speech quality into account. The results showed that both factors influenced the processing of human and synthetic speech, and confirmed that eye tracking is a promising albeit time consuming research method to evaluate synthetic speech.

Sentence Level Intelligibility Evaluation for Mandarin Text-to-Speech Systems Using Semantically Unpredictable Sentences
Jian Li, Research and Development Center, Toshiba (China) Co., LTD.
Dmitry Sityaev, Cambridge Research Laboratory, Toshiba Research Europe Limited
Jie Hao, Research and Development Center, Toshiba (China) Co., LTD.

Intelligibility assessment is one of the important aspects in the text-to-speech system (TTS) evaluation. Several intelligibility assessment methods have been proposed and successfully applied to European languages, both at word level and sentence level. Since Mandarin has its own unique features, these methods must be modified when applying to Mandarin. The word level assessment methods such as DRT and MRT have successfully been modified and extended to Mandarin (e.g. CDRT, CDRT-tone and CMRT). Sentence level assessment methods, on the other hand, have not been well studied for Mandarin. This paper focuses on the Semantically Unpredictable Sentences (SUS) test, which is one of the most commonly used sentence level assessment methods, and considers several important aspects of the SUS test design when extending it to Mandarin. It also compares the SUS test for Mandarin with CDRT, CDRT-tone and CMRT.

N-best: The Northern- and Southern-Dutch Benchmark Evaluation of Speech recognition Technology
Judith Kessens, TNO Human Factors
David van Leeuwen, TNO Human Factors

In this paper, we describe N-best 2008, the first Large Vocabulary Speech Recognition (LVCSR) benchmark evaluation held for the Dutch language. Both broadcast news and conversational telephone speech recognition will be assessed for two accents (Northern- and Southern Dutch). The N-best evaluation will take place in the spring of 2008 and is open to all research institutes and industries on voluntary basis. The goals of this first N-best evaluation is to define, set-up and conduct a Dutch LVCSR benchmark evaluation. In this paper, we will describe the state-of-the-art of Dutch LVCSR, recognition problems that are typical for the Dutch language and the evaluation protocol.

A MAP based approach to adaptive speech intelligibility measurements
Trym Holter, Acoustic Research Centre, SINTEF ICT
Svein Sorsdal, Acoustic Research Centre, SINTEF ICT

This paper presents an adaptive procedure applied to measurements of speech intelligibility using the modified rhyme test. It is argued that the required speech-to-noise (SNR) ratio could be estimated with sufficient accuracy with as few as 25 utterances. The present procedure is based on the maximum a posteriori criterion, and it is demonstrated how the standard deviation of the SNR estimate can be improved by about 0.5 dB compared to a previously published method based on the maximum likelihood procedure.

Phone Boundary Detection using Selective Refinements and Context-dependent Acoustic Features
Sirinoot Boonsuk, Department of Computer Engineering,Chulalongkorn University
Proadpran Punyabukkana, Department of Computer Engineering,Chulalongkorn University
Atiwong Suchato, Department of Computer Engineering, Chulalongkorn University

Accurate placement of phone boundaries plays an important role in speech recognition systems as well as high quality unit selection for speech synthesis. This study proposes a post-processing technique to refine the locations of phone boundaries provided by HMM-based forced alignment. The context-dependent Linear Discriminant Analysis (LDA) classifiers together with a confidence scoring scheme are utilized to improve the precision of locating phone boundaries. Every acoustic feature is not always suitable for locating boundaries between every type of phonetic segment. Therefore, feature selections are performed based on boundary types. The proposed context-dependent refinement results in a 43.9% error reduction in locating phone boundaries compared to the ones obtained from an HMM-based force alignment. The average deviation, from manually labeled boundaries, is reduced from 1.4 to 1.0 frame when the frame size used is 10 milliseconds.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo