Interspeech 2007 Session ThD.O1: Systems for LVCSR and rich transcription I
Thursday, August 30, 2007
16:00 – 18:00
Thomas Hain (University of Sheffield)
The RWTH 2007 TC-STAR Evaluation System for European English and Spanish
Jonas Lööf, Lehrstuhl für Informatik 6, Computer Science Dept., RWTH Aachen University, Aachen, Germany
Christian Gollan, Lehrstuhl für Informatik 6, Computer Science Dept., RWTH Aachen University, Aachen, Germany
Stefan Hahn, Lehrstuhl für Informatik 6, Computer Science Dept., RWTH Aachen University, Aachen, Germany
Georg Heigold, Lehrstuhl für Informatik 6, Computer Science Dept., RWTH Aachen University, Aachen, Germany
In this work, the RWTH automatic speech recognition systems developed for the third TC-STAR evaluation campaign 2007 are presented. The RWTH systems make systematic use of internal system combination, combining systems with differences in feature extraction, adaptation methods, and training data used. To take advantage of this, novel feature extraction methods were employed; this year saw the introduction of Gammatone features and MLP based phone posterior features. Further improvements were achieved using unsupervised training, and it is notable that these improvements were achieved using a fairly low amount of automatically transcribed data. Also contributing to the improvements over last year was the switch to MPE training, and the introduction of projecting SAT transforms.
Using Direction of Arrival Estimate and Acoustic Feature Information in Speaker Diarization
Eugene Chin Wei Koh, School of Computer Engineering, Nanyang Technological University (NTU), Singapore 639798
Hanwu Sun, Speech and Dialogue Processing Lab, Institute for Infocomm Research (I2R), Singapore 119613
Tin Lay Nwe, Speech and Dialogue Processing Lab, Institute for Infocomm Research (I2R), Singapore 119613
Trung Hieu Nguyen, School of Computer Engineering, Nanyang Technological University (NTU), Singapore 639798
Bin Ma, Speech and Dialogue Processing Lab, Institute for Infocomm Research (I2R), Singapore 119613
Eng-Siong Chng, School of Computer Engineering, Nanyang Technological University (NTU), Singapore 639798
Haizhou Li, Speech and Dialogue Processing Lab, Institute for Infocomm Research (I2R), Singapore 119613
Susanto Rahardja, Speech and Dialogue Processing Lab, Institute for Infocomm Research (I2R), Singapore 119613
This paper describes the I2R/NTU system submitted for the NIST Rich Transcription 2007 (RT-07) Meeting Recognition evaluation Multiple Distant Microphone (MDM) task. In our implementation, the Direction of Arrival (DOA) information is specifically used to perform speaker turn detection and clustering. Cluster purification is then carried out by performing GMM modeling on acoustic features. Finally, non-speech & silence removal is effected to remove unwanted segments. The system achieved an overall DER of 31.02% on the NIST Rich Transcription Spring 2006 evaluation tasks.
Recovering Punctuation Marks for Automatic Speech Recognition
Fernando Batista, L2F INESC-ID / ISCTE
Diamantino Caseiro, L2F INESC-ID/ IST
Nuno Mamede, L2F INESC-ID/ IST
Isabel Trancoso, L2F INESC-ID/ IST
This paper shows results of recovering punctuation over speech transcriptions for a Portuguese broadcast news corpus. The approach is based on maximum entropy models and uses word, part-of-speech, time and speaker information. The contribution of each type of feature is analyzed individually. Separate results for each focus condition are given, making it possible to analyze the differences of performance between planned and spontaneous speech.
Disfluency Correction of Spontaneous Speech using Conditional Random Fields with Variable-Length Features
Jui-Feng Yeh, Department of Computer Science and Information Engineering, Far East University
Chung-Hsien Wu, Department of Computer Science and Information Engineering, National Cheng Kung University
Wei-Yen Wu, Department of Computer Science and Information Engineering, National Cheng Kung University
This paper presents an approach to detecting and correcting edit disfluency based on conditional random fields with variable-length features. The variable-length features consist of word, chunk and sentence features. Conditional random fields (CRF) are adopted to model the properties of the edit disfluency, including repair, repetition and restart, for edit disfluency detection. For the evaluation of the proposed method, Mandarin conversational dialogue corpus (MCDC) is used. The detection error rate of edit word is 17.3%. Compared with DF-gram, Maximum Entropy and the approach combining language model and alignment model, the proposed approach achieves 11.7%, 8% and 3.9% improvements, respectively. The experimental results show that the proposed model outperforms other methods and efficiently detects and corrects edit disfluency in spontaneous speech.
Detection, Diarization, and Transcription of Far-Field Lecture Speech
Jing Huang, IBM T.J. Watson Research Center
Etienne Marcheret, IBM T.J. Watson Research Center
Karthik Visweswariah, IBM T.J. Watson Research Center
Vit Libal, IBM T.J. Watson Research Center
Gerasimos Potamianos, IBM T.J. Watson Research Center
Speech processing of lectures in smart rooms has been central to Rich Transcription (RT) Meeting Recognition Evaluation, sponsored by NIST, with emphasis placed on benchmarking speech activity detection (SAD), speaker diarization (SPKR), speech-to-text (STT), and speaker-attributed STT (SASTT) technologies. We present the IBM systems developed to address these tasks in preparation for the RT07s evaluation, focusing on the far-field condition of lecture data collected as part of EU project CHIL. The systems are benchmarked on a subset of the RT06s evaluation test set, where they yield significant improvements for all SAD, SPKR, and STT tasks over RT06s results; for example, a 16% relative reduction in word error rate is reported in STT, attributed to a number of system advances discussed here. Initial results are also presented on SASTT, a task newly introduced in 2007 in place of the discontinued SAD.
Speech-Based Annotation and Retrieval of Digital Photographs
Timothy J. Hazen, Massachusetts Institute of Technology
Brennan Sherry, Massachusetts Institute of Technology
Mark Adler, Nokia Research Center
In this paper we describe the development of a speech-based annotation and retrieval system for digital photographs. The system uses a client/server architecture which allows photographs to be captured and annotated on light-weight clients, such as mobile camera phones, and then processed, indexed and stored on networked servers. For speech-based retrieval we have developed a mixed grammar recognition approach which allows the speech recognition system to construct a single finite-state network combining context-free grammars, for recognizing and parsing query carrier phrases and metadata phrases, with an unconstrained statistical n-gram model for recognizing free-form search terms. Experiments demonstrating successful retrieval of photographs using purely speech-based annotation and retrieval are presented.