Interspeech 2007 Session FrB.P2b: Systems for LVCSR and rich transcription II
Friday, August 31, 2007
10:00 – 12:00
Dilek Hakkani-Tur (ICSI)
Co-training Using Prosodic and Lexical Information for Sentence Segmentation
Umit Guz, ICSI
Sébastien Cuendet, ICSI
Dilek Hakkani-Tür, ICSI
Gokhan Tur, SRI
This paper investigates the application of the co-training learning algorithm on sentence boundary classification problem by using lexical and prosodic information. Co-training is a semi-supervised machine learning algorithm that uses multiple weak classifiers with a relatively small amount of labeled data and incrementally uses unlabeled data. The assumption in co-training is that the classifiers can co-train each other, as one can label samples which are difficult for the other. Sentence segmentation problem is very appropriate for the co-training method since it satisfies the main requirements of the co-training algorithm: the dataset can be described by two disjoint and natural views that are redundantly sufficient. In our case, the feature sets are capturing lexical and prosodic information. The experimental results on the ICSI Meeting (MRDA) corpus show the effectiveness of the co-training algorithm for this task.
Extracting true speaker identities from transcriptions
Yannick Estève, LIUM - Université du Maine
Sylvain Meignier, LIUM - Université du Maine
Paul Deléglise, LIUM - Université du Maine
Julie Mauclair, LIUM - Université du Maine
Automatic speaker diarization generally produces a generic label such a spkr1 rather than the true identity of the speaker. Recently, two approaches based on lexical rules were proposed to extract the true identity of the speaker from the transcription of the audio recording without any a priori acoustic information: one uses ngram, the other one uses semantic classification trees. The latter was proposed by the authors of this paper. In this paper, the two methods are compared in experiments carried out on French broadcast news records from the ESTER 2005 evaluation campaign. Experiments are processed on manual and automatic transcriptions: the SCT -based approach gives the best results on automatic transcriptions.
An Improved Speaker Diarization System
Rong Fu, Department of Computer Science, University of York, York, YO10 5DD, UK
Ian Benest, Department of Computer Science, University of York, York, YO10 5DD, UK
This paper describes an automatic speaker diarization system for natural, multi-speaker meeting conversations. Only one central microphone is used to record the meeting. The new system is robust to different acoustic environments - it requires neither pre-training models nor development sets to initialize the parameters. The new system determines the model complexity automatically. It adapts the segment model from a universal background model, and uses the cross-likelihood ratio instead of the Bayesian Information Criterion (BIC) for merging. Finally it uses an intra-cluster/inter-cluster ratio as the stopping criterion. Altogether this reduces the speaker diarization error rate from 21.76% to 17.21% compared with the baseline system .
The ISL 2007 English Speech Transcription System for European Parliament Speeches
Sebastian Stüker, Institut für Theoretische Informatik, Universität Karlsruhe (TH), Karlsruhe, Germany
Christian Fügen, Institut für Theoretische Informatik, Universität Karlsruhe (TH), Karlsruhe, Germany
Florian Kraft, Institut für Theoretische Informatik, Universität Karlsruhe (TH), Karlsruhe, Germany
Matthias Wölfel, Institut für Theoretische Informatik, Universität Karlsruhe (TH), Karlsruhe, Germany
The project Technology and Corpora for Speech to Speech Translation (TC-STAR) aims at making a break-through in speech-to-speech translation research, significantly reducing the gap between the performance of machines and humans at this task. Technological and scientific progress is driven by periodic, competitive evaluations within the project. In this paper we describe the ISL speech transcription system for English European Parliament speeches with which we participated in the third TC-STAR evaluation campaing in the spring of 2007. The improvements over last year's system originate from a recognition hypotheses based segmentation, the utilization of unsupervised in-domain training material, a modified cross-system adaptation and combination scheme, and the enhancement of the language model through the use of web based training material.
Advances in Mandarin Broadcast Speech Recognition
Mei-Yuh Hwang, University of Washington
Wen Wang, SRI International
Xin Lei, University of Washington
Jing Zheng, SRI International
Ozgur Cetin, International Computer Science Institute
Gang Peng, University of Washington
We describe our continuing efforts to improve the UW-SRI-ICSI Mandarin broadcast speech recognizer. This includes increasing acoustic and text training data, adding discriminative features, incorporating frame-level discriminative training criterion, multiple-pass acoustic model (AM) cross adaptation, language model (LM) genre adaptation and system combination. The net effect without LM adaptation was a 24%-64% relative reduction in character error rates (CERs) on a variety of test sets. In addition, LM adaptation gave us another 6% of relative CER reduction on broadcast conversations.
Automatic Transcription for a Web 2.0 Service to Search Podcasts
Jun Ogata, National Institute of Advanced Industrial Science and Technology (AIST)
Masataka Goto, National Institute of Advanced Industrial Science and Technology (AIST)
Kouichirou Eto, National Institute of Advanced Industrial Science and Technology (AIST)
This paper describes speech recognition techniques that enable a Web 2.0 service "PodCastle" where users can search and read transcribed texts of podcasts, and correct recognition errors in those texts. Most previous speech recognizers had difficulties transcribing podcasts because podcasts include various kinds of contents recorded in different conditions and cover recent topics that tend to have many out-of-vocabulary words. To overcome such difficulties, we continuously improve speech recognizers by using information aggregated on the basis of Web 2.0. For example, a language model is adapted to a topic of the target podcast on the fly, the pronunciations of out-of-vocabulary words are obtained from a Web 2.0 service, and an acoustic model is trained by using the results of the error correction by anonymous users. The experiments we report in this paper show that our techniques produce promising results for podcasts.