Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session FrC.P1a: Spoken language understanding and summarization


Type poster
Date Friday, August 31, 2007
Time 13:30 – 15:30
Room Foyer
Chair Gokhan Tur (SRI)

FrC.P1a‑1

A Comparative Study on Speech Summarization of Broadcast News and Lecture Speech
Jian Zhang, Human Language Technology Center, HKUST
Ho Yin Chan, Human Language Technology Center, HKUST
Pascale Fung, Human Language Technology Center, HKUST
Lu Cao, Human Language Technology Center, HKUST

We carry out a comprehensive study of acoustic/prosodic, linguistic and structural features for speech summarization, contrasting two genres of speech, namely Broadcast News and Lecture Speech. We find that acoustic and structural features are more important for Broadcast News summarization due to the speaking styles of anchors and reporters, as well as typical news story flow. Due to the relatively small contribution of lexical features, Broadcast News summarization does not depend heavily on ASR accuracies. We use SVM based summarizer to select the best features for extractive summarization, and obtain state-of-the-art performances: ROUGE-L F-measure of 0.64 for Mandarin Broadcast News, and 0.65 for Mandarin Lecture Speech. In the case of Lecture Speech summarization where lexical features are more important, we make the surprising discovery that summarization performance is very high (0.63 ROUGE-L F-measure) even when the ASR accuracy is low (21% CER).
FrC.P1a‑2

Towards Online Speech Summarization
Gabriel Murray, CSTR, University of Edinburgh
Steve Renals, CSTR, University of Edinburgh

The majority of speech summarization research has focused on extracting the most informative dialogue acts from recorded, archived data. However, a potential use case for speech summarization in the meetings domain is to facilitate a meeting in progress by providing the participants - whether they are attending in-person or remotely - with an indication of the most important parts of the discussion so far. This requires being able to determine whether a dialogue act is extract-worthy before the global meeting context is available. This paper introduces a novel method for weighting dialogue acts using only very limited local context, and shows that high summary precision is possible even when information about the meeting as a whole is lacking. A new evaluation framework consisting of weighted precision, recall and f-score is detailed, and the novel online summarization method is shown to significantly increase recall and f-score compared with a method using no contextual information.
FrC.P1a‑3

System Request Detection in Conversation Based on Acoustic and Speaker Alternation Features
Tomoyuki Yamagata, Kobe university
Atsushi Sako, Kobe university
Tetsuya Takiguchi, Kobe university
Yasuo Ariki, Kobe university

For a hands-free speech interface, it is important to detect commands in spontaneous utterances. To discriminate commands from human-human conversations by acoustic features, it is efficient to consider the head and the tail of an utterance. The different characteristics of system requests and spontaneous utterances appear on these parts of an utterance. Experiment shows that by separating the head and the tail of an utterance, the accuracy of detection was improved. And also, considering the alternation of speakers using two channel microphones improved the performance. Although detecting system requests using linguistic features shows high accuracy, combining acoustic and turn-taking features lift up the performance.
FrC.P1a‑4

Selecting On-Topic Sentences from Natural Language Corpora
Michael Levit, BBN Technologies
Elizabeth Boschee, BBN Technologies
Marjorie Freedman, BBN Technologies

We describe a system that examines input sentences with respect to arbitrary topics formulated as natural language expressions. It extracts predicate-argument structures from text intervals and links them into semantically organized proposition trees. By instantiating trees constructed for topic descriptions in trees representing input sentences or parts thereof, we are able to assess degree of "topicality" for each sentence. The presented strategy was used in the BBN distillation system for the GALE Year 1 evaluation and achieved outstanding results compared to other systems and human participants.
FrC.P1a‑5

A Semi-supervised Method for Efficient Construction of Statistical Spoken Language Understanding Resources
Seokhwan Kim, Pohang University of Science and Technology
Minwoo Jeong, Pohang University of Science and Technology
Gary Geunbae Lee, Pohang University of Science and Technology

We present a semi-supervised framework to construct spoken language understanding resources with very low cost. We generate context patterns with a few seed entities and a large amount of unlabeled utterances. Using these context patterns, we extract new entities from the unlabeled utterances. The extracted entities are appended to the seed entities, and we can obtain the extended entity list by repeating these steps. Our method is based on an utterance alignment algorithm which is a variant of the biological sequence alignment algorithm. Using this method, we can obtain precise entity lists with high coverage, which is of help to reduce the cost of building resources for statistical spoken language understanding systems.
FrC.P1a‑6

Automatic Extraction of Cue Phrases for Important Sentences in Lecture Speech and Automatic Lecture Speech Summarization
Yasuhisa Fujii, Toyohashi University of Technology
Norihide Kitaoka, Nagoya University
Seiichi Nakagawa, Toyohashi University of Technology

We automatically extract the summaries of spoken class lectures.This paper presents a novel method for sentence extraction-based automatic speech summarization. We propose a technique that extracts "cue phrases for important sentences (CPs)" that often appear in important sentences. We formulate CP extraction as a labeling problem of word sequences and use Conditional Random Fields (CRF) for labeling. Automatic summarization using CP extraction results as features yields precisions of 0.603 and 0.556 when using manual transcriptions and Automatic Speech Recognition (ASR) results, respectively. Combining the features derived from the CPs and traditional features (including repeated words, words repeated in a slide text, and term frequency (tf), which are surface linguistic information, and speech power and duration, which are prosodic features), we obtained better summarization performance with a kappa-value of 0.380, a F-measure of 0.539, and a Rouge-4 of 0.709.
FrC.P1a‑7

A Unified Probabilistic Generative Framework for Extractive Spoken Document Summarization
Yi-Ting Chen, National Taiwan Normal University and Institute of Information Science, Academia Sinica, Taiwan
Hsuan-Sheng Chiu, National Taiwan Normal University, Taiwan
Hsin-Min Wang, Institute of Information Science, Academia Sinica, Taiwan
Berlin Chen, National Taiwan Normal University, Taiwan

In this paper, we consider extractive summarization of Chinese broadcast news speech. A unified probabilistic generative framework that combined the sentence generative probability and the sentence prior probability for sentence ranking was proposed. Each sentence of a spoken document to be summarized was treated as a probabilistic generative model for predicting the document. Two different matching strategies, i.e., literal term matching and concept matching, were extensively investigated. We explored the use of the hidden Markov model (HMM) and relevance model (RM) for literal term matching, while the word topical mixture model (WTMM) for concept matching. On the other hand, the confidence scores, structural features, and a set of prosodic features were properly incorporated together using the whole sentence maximum entropy model (WSME) for the estimation of the sentence prior probability. The experiments were performed on the Chinese broadcast news collected in Taiwan. Very promising and encouraging results were initially obtained.
FrC.P1a‑8

Generic class-based statistical language models for robust speech understanding in directed dialog applications
Matthieu Hebert, Nuance Communications

We investigate the usage of class-based statistical language models (SLMs) for robust speech understanding. Generic class-based SLMs are built using data from several applications and then tested on data from a distinct target application to benchmark their portability. The results show that these generic class-based SLMs perform as well as those trained on data from the target testing application. This leads us to conclude that, for directed dialog applications, words that do not fall within a rule (class) are generic across applications. Also, the generic class-based SLMs can be used to automatically transcribe utterances from the target application with high accuracy. These transcriptions are then used to train a word-based SLM; the resulting word-based SLM outperforms the class-based ones.
FrC.P1a‑9

Robust Location Understanding in Spoken Dialog Systems Using Intersections
Michael Seltzer, Microsoft Research
Yun-Cheng Ju, Microsoft Research
Ivan Tashev, Microsoft Research
Alex Acero, Microsoft Research

The availability of digital maps and mapping software has led to significant growth in location-based software and services. To safely use these applications in mobile and automotive scenarios, users must be able to input precise locations using speech. In this paper, we propose a novel method for location understanding based on spoken intersections. The proposed approach utilizes a rich, automatically-generated grammar for street names that maps all street name variations into a single canonical semantic representation. This representation is then transformed to a sequence of position-dependent subword units. This sequence is used by a classifier based on the vector space model to reliably recognize spoken intersections in the presence of recognition errors and incomplete street names. The efficacy of the proposed approach is demonstrated using data collected from users of a deployed spoken dialog system.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo