Interspeech 2007 Session ThD.P3b: Spoken data retrieval II
Thursday, August 30, 2007
16:00 – 18:00
Murat Saraclar (Bogaziçi University)
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation
Roy Wallace, Speech and Audio Research Laboratory, Queensland University of Technology
Robbie Vogt, Speech and Audio Research Laboratory, Queensland University of Technology
Sridha Sridharan, Speech and Audio Research Laboratory, Queensland University of Technology
This paper details the submission from the Speech and Audio Research Lab of Queensland University of Technology (QUT) to the inaugural 2006 NIST Spoken Term Detection Evaluation. The task involved accurately locating the occurrences of a specified list of English terms in a given corpus of broadcast news and conversational telephone speech. The QUT system uses phonetic decoding and Dynamic Match Lattice Spotting to rapidly locate search terms, combined with a neural network-based verification stage. The use of phonetic search means the system is open vocabulary and performs usefully (Actual Term-Weighted Value of 0.23) whilst avoiding the cost of a large vocabulary speech recognition engine.
Integration of Retrieval Results using Plural Subword Models for Improving Vocabulary-free Spoken Document Retrieval
Yoshiaki Itoh, Iwate Prefectural University
Kohei Iwata, Iwate Prefectural University
Kazunori Kojima, Iwate Prefectural University
Masaaki Ishigame, Iwate Prefectural University
Kazuyo Tanaka, University of Tsukuba
Shi-wook Lee, AIST
Spoken document retrieval (SDR) systems must be vocabulary-free in order to deal with arbitrary query words because a user often searches the section where a query word is spoken, and query words are liable to be special terms that are not included in a speech recognizer’s dictionary. We have previously proposed new subword models, such as the 1/2 phone model, the 1/3 phone model, and the sub-phonetic segment (SPS) model, and have confirmed the effectiveness of these models for SDR . These models are more sophisticated on the time axis than phoneme models such as the triphone model. The present paper proposes an integration method of plural retrieval results that are obtained from each subword model and demonstrates the performance improvement through experiments using an actual presentation speech corpus.
The SRI/OGI 2006 Spoken Term Detection System
Dimitra Vergyri, SRI International
Izhak Shafran, OGI School of Science and Engineering
Andreas Stolcke, SRI International
Ramana Rao Gadde, SRI International
Murat Akbacak, SRI International
Brian Roark, OGI School of Science and Engineering
Wen Wang, SRI International
This paper describes the system developed jointly at SRI and OGI for participation in the 2006 NIST Spoken Term Detection (STD) evaluation. We participated in the three genres of the English track: Broadcast News (BN), Conversational Telephone Speech (CTS), and Conference Meetings (MTG). The system consists of two phases. First, audio indexing, an offline phase, converts the input speech waveform into a searchable index. Second, term retrieval, possibly an online phase, returns a ranked list of occurrences for each search term. We used a word-based indexing approach, obtained with SRI's large vocabulary Speech-to-Text (STT) system. Apart from describing the submitted system and its performance on the NIST evaluation metric, we study the trade-offs between performance and system design. We examine performance versus indexing speed, effectiveness of different index ranking schemes on the NIST score, and the utility of approaches to deal with out-of-vocabulary (OOV) terms.
PodCastle: A Web 2.0 Approach to Speech Recognition Research
Masataka Goto, National Institute of Advanced Industrial Science and Technology (AIST)
Jun Ogata, National Institute of Advanced Industrial Science and Technology (AIST)
Kouichirou Eto, National Institute of Advanced Industrial Science and Technology (AIST)
In this paper, we describe a public web service, "PodCastle", that provides full-text searching of Japanese podcasts on the basis of automatic speech recognition. This is an instance of our research approach, "Speech Recognition Research 2.0", which is aimed at providing users with a web service based on Web 2.0 so that they can experience state-of-the-art speech recognition performance, and at promoting speech recognition technologies in cooperation with anonymous users. PodCastle enables users to find podcasts that include a search term, read full texts of their recognition results, and easily correct recognition errors. The results of the error correction can then be used to improve the performance of both full-text search and speech recognition. Although we know of no state-of-the-art speech recognizer that can successfully transcribe all of the various kinds of podcasts, the mechanism we propose will gradually increase the usefulness and applicability of PodCastle.
Speech Mining in Noisy Audio Message Corpus
Nathalie Camelin, LIA, University of Avignon
Frederic Bechet, LIA, University of Avignon
Geraldine Damnati, France Telecom R&D
Renato De Mori, LIA, University of Avignon
Within the framework of automatic analysis of spoken telephone surveys we propose a robust Speech Mining strategy that selects, from a large database of spoken messages, the ones likely to be correctly processed by the Automatic Speech Recognition and Classification processes. The problem considered in this paper is the analysis of messages uttered by the users of a telephone service in response to a recorded message that asks if a problem they had was satisfactorily solved. Very often in these cases, subjective information is combined with factual information. The purpose of this type of analysis is the extraction of the distribution of users’ opinions. Therefore it is very important to check the representativeness of the subset of messages kept by the rejection strategies. Several measures, based on the Kullback-Leibler divergence, are proposed in order to evaluate the correctness of the information extracted as well as its representativeness.
A Fast Fuzzy Keyword Spotting Algorithm Based on Syllable Confusion Network
Jian Shao, ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences
Qingwei Zhao, ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences
Pengyuan Zhang, ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences
Zhaojie Liu, ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences
Yonghong Yan, ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences
This paper presents a fast fuzzy search algorithm to extract keyword candidates from syllable confusion networks (SCNs) in Mandarin spontaneous speech. Since the recognition accuracy of spontaneous speech is quite poor, syllable confusion matrix (SCM) is applied to compensate for the recognition errors and to improve recall. For fast retrieval, an efficient vocabulary-independent index structure is designed, which selects individual arcs of syllable confusion network as indexing units. An inverted search algorithm that uses syllable confusion matrix to calculate relevance score and search in this index structure is proposed. In experiments performed on a telephone conversational task, the equal error rate (EER) was reduced by about 33% relative over the baseline where keywords are directly extracted from phoneme lattices. Additionally, it only took computer one or two seconds to search 100 keywords in one hour speech data.
Advances in SpeechFind: Transcript Reliability Estimation Employing Confidence Measure based on Discriminative Sub-word Model for SDR
Wooil Kim, Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas, Richardson, Texas, USA
John H. L. Hansen, Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas, Richardson, Texas, USA
This study presents our recent advances in our spoken document retrieval (SDR) system SpeechFind including our partnership with the Collaborative Digitization Program (CDP). A proto-type of SpeechFind for the CDP is currently serving as the search engine for 1,300 hours of the CDP audio content. In this paper, a reliability estimation method for the ASR-generated transcripts is proposed to provide more effective retrieval information for SpeechFind. The proposed estimator is based on Bayesian classification employing several confidence measures. We also propose a novel confidence measure for reliability estimation employing acoustically discriminative sub-word models. Experimental results on CDP material demonstrate that the proposed confidence measure is effective in improving the reliability estimator. By employing the proposed confidence measure based on discriminative model, 10.5% and 20.9% relative improvements were obtained in accuracy and critical error respectively.
An Interactive Timeline for Speech Database Browsing
Benoit Favre, LIA, University of Avignon
Jean-François Bonastre, LIA, University of Avignon
Patrice Bellot, LIA, University of Avignon
Speech databases lack efficient interfaces to explore information along time. We introduce an interactive timeline that helps the user in browsing an audio stream on a large time scale and recontextualize targeted information. Time can be explored at different granularities using synchronized scales. We try to take advantage of automatic transcription to generate a conceptual structure of the database. The timeline is annotated with two elements to reflect the information distribution relevant to a user need. Information density is computed using an information retrieval model and displayed as a continuous shade on the timeline whereas anchorage points are expected to provide a stronger structure and to guide the user through his exploration. These points are generated using an extractive summarization algorithm. We present a prototype implementing the interactive timeline to browse broadcast news recordings.