Interspeech 2007 Session WeB.P1a: Speech and other modalities
Wednesday, August 29, 2007
10:00 – 12:00
Alexandros Potamianos (Technical University of Crete)
Analysis of head motions and speech in spoken dialogue
Carlos Toshinori Ishi, ATR - IRC Labs.
Hiroshi Ishiguro, ATR - IRC Labs.
Norihiro Hagita, ATR - IRC Labs.
With the aim of automatically generating head motions from speech, analyses are conducted for verifying the relations between head motions and linguistic and paralinguistic information carried by speech. Analyses are conducted on motion captured data during natural dialogue. Analysis results showed that nods frequently occur during speech utterances, not only for expressing dialog acts such as agreement and affirmation, but also as indicative of syntactic or semantic units, appearing at the last syllable of the phrases, in strong phrase boundaries. The paper also analyzes the dependence on linguistic, prosodic and voice quality information of other head motions, like shakes and tilts, and discuss about the potentiality for their use in automatic generation of head motions.
A Paradigm for Mobile Speech-Centric Services
Lars Bo Larsen, Department of Electronic Systems, Aalborg University
Kasper Løvborg Jensen, Department of Electronic Systems, Aalborg University
Søren Larsen, Department of Electronic Systems, Aalborg University
Morten Højfeldt Rasmussen, Department of Electronic Systems, Aalborg University
The work presented in this paper describes a new paradigm for speech interaction on mobile devices. A general framework for a distributed architecture is introduced and described. This is followed by a discussion of how to design multi modal interfaces affording spoken input. The solution has been to create an architecture capable of supporting several alternative GUIs, e.g. with spoken input, stylus input or a combination. Speech GUIs are designed entirely without GUI widgets requiring stylus or button input, instead relying on highlighting parts of text to create emphasis and steer the users’ attention. This is exemplified through the presentation of a prototype for a Car Rental application
Design and Recording of Czech Sign Language Corpus for Automatic Sign Language Recognition
Pavel Campr, University of West Bohemia in Pilsen
Marek Hruz, University of West Bohemia in Pilsen
Milos Zelezny, University of West Bohemia in Pilsen
We describe the design, recording and content of a Czech Sign Language database in this paper. The database is intended for training and testing of sign language recognition (SLR) systems. The UWB-06-SLR-A database contains video data of 15 signers recorded from 3 different views, two of them capture whole body and provide 3D motion data, and third one is focused on signer's face and provide data for face expression feature extraction and for lipreading. The corpus consists of nearly 5 hours of processed and annotated video files which were recorded in laboratory conditions using static illumination. The whole corpus is annotated and pre-processed to be ready to use in SLR experiments. It is composed of 25 selected signs from Czech Sign Language. Each signer performed all of these signs with 5 repetitions. Altogether the database contains more than 5500 video files where each file contains one isolated sign.
Pushy versus meek – using avatars to influence turn-taking
Jens Edlund, Centre for Speech Technology, KTH, Stockholm
Jonas Beskow, Centre for Speech Technology, KTH, Stockholm
The flow of spoken interaction between human interlocutors is a widely studied topic. Amongst other things, studies have shown that we use a number of facial gestures to improve this flow – to control the taking of turns. This ought to be useful in systems where an animated talking head is used, be they systems for computer mediated human-human dialogue or spoken dialogue systems, where the computer itself uses speech to interact with users. In this article, we show that a small set of simple interaction control gestures and a simple model of interaction can be used to influence users’ behaviour in an unobtrusive manner. The results imply that such a model may improve the flow of computer mediated interaction between humans under adverse circumstances, such as network latency, or to create more human-like spoken human-computer interaction.
Wavelet-based Front-End for Electromyographic Speech Recognition
Michael Wand, Universitaet Karlsruhe (TH), Germany
Szu-Chen Stan Jou, Carnegie Mellon University, Pittsburgh, PA, USA
Tanja Schultz, Carnegie Mellon University, Pittsburgh, PA, USA
In this paper we present our investigations on the potential of wavelet-based preprocessing for surface electromyographic speech recognition. We implemented several variants of the Discrete Wavelet Transform and applied them to electromyographical data. First we examined different transforms with various filters and decomposition levels and found that the Redundant Discrete Wavelet Transform performs the best among all tested wavelet transforms. Furthermore, we compared the best wavelet transform to our EMG optimized spectral- and timedomain features. The results showed that the best wavelet transform slightly outperforms the optimized features with 30.9% word error rate compared to 32% for the optimized EMG spectral and time-domain features. Both numbers were achieved on a 108 word vocabulary test set using phone based acoustic models trained on continuously spoken speech captured by EMG.
Intensive Gestures in French and their Multimodal Correlates
Gaëlle Ferré, Laboratoire Parole et Langage
Roxane Bertrand, Laboratoire Parole et Langage
Philippe Blache, Laboratoire Parole et Langage
Robert Espesser, Laboratoire Parole et Langage
Stéphane Rauzy, Laboratoire Parole et Langage
This paper relates a pilot study on intensive gestures in French – e.g. gestures which accompany speech and participate in the highlighting of some discourse elements which the paper means to determine. The study is based on spontaneous French informal conversation and the intensive gestures correlates we looked at pertained to the morphological, prosodic and gestural dimensions.
Aspects of Visual Speech in Arabic
Slim Ouni, LORIA - UMR 7503
Kais Ouni, LSTS-ENIT
In this paper, we present a study of visual speech in Arabic. More specifically, we performed a lipreading recognition experiment on Arabic, where a set of consonant-vowel stimuli were presented as visual-only speech and participants were asked to report what they recognized. The overall lipreading scores were consistent with other experiments in other languages. The resulting consonant confusion matrix shows that some of the phonemes were well discriminated, however, for others, it depends on the context. Results are discussed based on the category of phonemes and the vowel context.
Rigid vs Non-Rigid Face and Head Motion in Phone and Tone Perception
Denis Burnham, MARCS Labs, University of Western Sydney
Jessica Reynolds, MARCS Auditory Laboratories, University of Western Sydney, Australia
Eric Vatikiotis-Bateson, Department of Linguistics, University of British Columbia, Canada
Hani Yehia, Center for Research on Speech, Acoustics, Language and Music, UFMG, Brazil
Valter Ciocca, School of Audiology & Speech Sciences, University of British Columbia, Canada
Rua Haszard Morris, MARCS Auditory Laboratories, University of Western Sydney, Australia
Harold Hill, School of Psychology, University of Wollongong, Australia
Guillaume Vignali, MARCS Auditory Laboratories, University of Western Sydney, Australia
Sandra Bollwerk, MARCS Auditory Laboratories, University of Western Sydney, Australia
Helen Tam, MARCS Auditory Laboratories, University of Western Sydney, Australia
Caroline Jones, School of Education, University of New South Wales, Australia
There is recent evidence that the visual concomitants, not only of the articulation of phones (consonants & vowels), but also of tones (fundamental frequency variations that signal lexical meaning in tone languages) facilitate speech perception. Analysis of speech production data from a Cantonese speaker suggested that the source of this perceptual information for tones involve rigid motion of the head rather than non-rigid face motion. A perception study using OPTOTRAK output in which rigid or non-rigid motion could be presented independently in tone-differing or phone- differing conditions, suggests that non-rigid motion is most useful for the discrimination of phones, whereas rigid motion is most useful for the discrimination of tones.