Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session WeB.O3: Multimodal speech recognition

Type oral
Date Wednesday, August 29, 2007
Time 10:00 – 12:00
Room Marble
Chair Martin Russell (University of Birmingham, UK)

A HMM recognition of consonant-vowel syllables from lip contours: the Cued Speech case
Noureddine Aboutabit, ICP, Département Parole et Cognition de GIPSA-lab
Denis Beautemps, ICP, Département Parole et Cognition de GIPSA-lab
Jeanne Clarke, ICP, Département Parole et Cognition de GIPSA-lab
Laurent Besacier, Laboratoire d’Informatique de Grenoble

Cued Speech (CS) is a manual code that complements lip-reading to enhance speech perception from visual input. The phonetic translation of CS gestures needs to combine the manual CS information with information from the lips, taking into account the desynchronization delay (Attina et al., 2004, Aboutabit et al., 2006) between these two flows of information. This paper focuses on HMM recognition of the lip flow for Consonant Vowel (CV) syllables in the French Cued Speech production context. The CV syllables are considered in term of viseme groups that are compatible with the CS system. The HMM modeling is based on parameters derived from both the inner and outer lip contours. The global recognition rate of CV syllable reaches 80.3%. This study shows that the errors are mainly observed on consonant groups in the context of high and mid-high rounded vowels. In contrast, CV syllables for anterior non rounded and low and mid-low rounded vowels are well recognized (in average 87%).
A Unified Approach to Multi-Pose Audio-Visual ASR
Patrick Lucey, Queensland University of Technology, Speech, Audio, Image, Video and Technology Laboratory
Gerasimos Potamianos, Human Language Technologies Department, IBM T.J Watson Research Center
Sridha Sridharan, Queensland University of Technology, Speech, Audio, Image, Video and Technology Laboratory

The vast majority of studies in the field of audio-visual automatic speech recognition (AVASR) assumes frontal images of a speaker's face. In contrast, our recent research efforts have concentrated on extracting visual speech information from profile views. The introduction of additional views to an AVASR system increases the complexity of the system as it has to deal with the different visual features associated with the various views. In this paper, we propose the use of linear regression to find a transformation matrix based on synchronous frontal and profile visual speech data, which is used to normalize the visual speech in each viewpoint into a single uniform view. For our experiments for the task of multi-speaker lipreading, we show that this pose-invariant technique reduces the train/test mismatch between visual speech features of different views and is of particular benefit when there is more training data for one viewpoint over another (e.g. frontal over profile).
Audio-visual Integration for Robust Speech Recognition Using Maximum Weighted Stream Posteriors
Rowan Seymour, Queen's University Belfast
Darryl Stewart, Queen's University Belfast
Ji Ming, Queen's University Belfast

In this paper, we demonstrate for the first time, the robustness of the Maximum Stream Posterior (MSP) method for audio-visual integration on a large speaker-independent speech recognition task in noisy conditions. Furthermore, we show that the method can be generalised and improved by using a softer weighting scheme to account for moderate noise conditions. We call this generalised method the Maximum Weighted Stream Posterior (MWSP) method. In addition, we carry out the first tests of the Posterior Union Model approach for audio-visual integration. All of the methods are compared in digit recognition tests involving various audio and video noise levels and conditions including tests where both modalities are affected by noise. We also introduce a novel form of noise called jitter which is used to simulate camera movement. The results verify that the MSP approach is robust and that its generalised form (MWSP) can lead to further improvements in moderate noise conditions.
Continuous-Speech Phone Recognition from Ultrasound and Optical Images of the Tongue and Lips
Gérard Chollet, CNRS-LTCI,ENST
Maureen Stone, Vocal Tract Visualization Lab, University of Maryland Dental School

The article describes a video-only speech recognition system for a “silent speech interface” application, using ultrasound and optical images of the voice organ. A one-hour audio-visual speech corpus was phonetically labeled using an automatic speech alignment procedure and robust visual feature extraction techniques. HMM-based stochastic models were estimated separately on the visual and acoustic corpus. The performance of the visual speech recognition system is compared to a traditional acoustic-based recognizer.
Multimodal speech recognition with ultrasonic sensors
Bo Zhu, MIT Computer Science and Artificial Intelligence Laboratory
T.J. Hazen, MIT Computer Science and Artificial Intelligence Laboratory
James Glass, MIT Computer Science and Artificial Intelligence Laboratory

In this research we explore multimodal speech recognition by augmenting acoustic information with that obtained by an ultrasonic emitter and receiver. After designing a hardware component to generate a stereo audio/ultrasound signal, we extract sub-band ultrasonic features that supplement conventional MFCC-based audio measurements. A simple interpolation method is used to combine audio and ultrasound model likelihoods. Experiments performed on a noisy continuous digit recognition task indicate that the addition of ultrasonic information reduces word error rates by 24-29% over a wide range of acoustic SNR (20-0 dB).
Fused HMM-Adaptation of Multi-Stream HMMs for Audio-Visual Speech Recognition
David Dean, SAIVT Group, Queensland University of Technology
Patrick Lucey, SAIVT Group, Queensland University of Technology
Sridha Sridharan, SAIVT Group, Queensland University of Technology
Tim Wark, CSIRO ICT Centre & SAIVT Group, QUT

A technique known as fused hidden Markov models (FHMMs) was recently proposed as an alternative multi-stream modelling technique for audio-visual speaker recognition. In this paper we show that for audio-visual speech recognition (AVSR), FHMMs can be adopted as a novel method of training synchronous MSHMMs. MSHMMs, as proposed by several authors for use in AVSR, are jointly trained on both the audio and visual modalities. In contrast our proposed FHMM adaptation method can be used to adapt the multi-stream models from single-stream audio HMMs, and in the process, better model the video speech in the final model when compared to jointly-trained MSHMMs. By experiments conducted on the XM2VTS database we show that the improved video performance of the FHMM-adapted MSHMMs results in an improvement in AVSR performance over jointly-trained MSHMMs at all levels of audio noise, and provide significant advantage in high noise environments.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo