Interspeech 2007 Session WeB.O2: Prosody production and perception
Wednesday, August 29, 2007
10:00 – 12:00
Marc Swerts (Tilburg University)
Predicting Focus through Prominence Structure
Sasha Calhoun, Centre for Speech Technology Research, University of Edinburgh
Focus is central to our control of information flow in dialogue. Spoken language understanding systems therefore need to be able to detect focus automatically. It is well known that prominence is a key marker of focus in English, however, the relationship is not straight-forward. We present focus prediction models built using the NXT Switchboard corpus. We claim that a focus is more likely if a word is more prominent than expected given its syntactic, semantic and discourse properties. Crucially, the perception of prominence arises not only from acoustic cues, but also the position in prosodic structure. Our focus prediction results, along with a study showing the acoustic properties of focal accents vary by structural position, support our claims. As a largely novel task, these results are an important first step in detecting focus for spoken language applications.
Analysis of emotional speech prosody in terms of part of speech tags
Murtaza Bulut, University of Southern California
Sungbok Lee, University of Southern California
Shrikanth Narayanan, University of Southern California
Representation of emotions in terms of acoustic features of well defined lexical elements is desired for development of emotional speech processing systems. For that purpose, in this paper, the interaction between emotions and part of speech (POS) tags is investigated. Utterances from 3 speakers in angry, happy, sad, and neutral emotions are used to statistically analyze the effects of emotion, POS tag type, position of the tag, and speaker factors on tag duration, energy, and F0 variables. It is found that the main effects of emotion, tag type, and position are significant. Results also show that the effect of emotion is significantly dependent on position, but not on POS tag type. The effect of position is noticeable. POS tags located in the first half of sentences have shorter durations, higher energy, and higher F0 values.
The Neutral Tone in Question Intonation in Mandarin
Fang Liu, Department of Linguistics, The University of Chicago, IL, USA
Yi Xu, Department of Phonetics and Linguistics, University College London, UK
This study investigates how the neutral tone, when preceded by different full tones under different focus conditions, behaves in question intonation in Mandarin. Results indicate that 1) the preceding full/neutral tone largely determines the local F0 trajectory of the neutral tone, but the latter also gradually converges over the course of several neutral tone syllables, 2) post-focus lowering, which is caused by the effect of focus, occurs in both neutral-tone-ending and High-tone-ending sentences, with the interrogative intonation in questions realized as an upward shift starting from the focused word, and 3) sentence-final neutral tone has a falling contour even in questions, thus contrasting with sentence-final High tone, which has a rising contour in questions.
Pointing to a target while naming it with /pata/ or /tapa/: the effect of consonants and stress position on jaw-finger coordination
Amélie Rochet-Capellan, Gipsa-lab Département parole et cognition (ICP)
Jean-Luc Schwartz, Gipsa-lab Département parole et cognition (ICP)
Rafael Laboissière, Max Planck Institute for Human Cognitive and Brain Science; U864 INSERM
Arthuro Galvàn, Max Planck Institute for Human Cognitive and Brain Science
This study investigates jaw-finger coordination in a task consisting in pointing to a target while naming it with a /pata/ or a /tapa/ utterance stressed either on the first ('CVCV) or on the second (CV'CV) syllable. Optotrack measurements of jaw and finger displacements show that for 'CVCV names, the moment at with the finger reaches the target alignment is synchronized with the maximum of the first jaw opening motion. For CV'CV names, the synchronization occurs between the moment at which the finger leaves the target-alignment position and the maximum of the jaw opening motion for the second vowel. This pattern of synchronization does not depend on the target position or on the consonants order. These results add some support to theories involving the coordination of orofacial and brachiomanual gestures in the development and phylogeny of human languages. They call for more investigations on the link between speech and brachiomanual gestures in face-to-face communication.
Suprasegmental aspects of pre-lexical speech in cochlear implanted children
Øydis Hide, CNTS, Department of Linguistics, University of Antwerp, Universiteitsplein 1, Wilrijk B-2610, Belgium
Steven Gillis, CNTS, Department of Linguistics, University of Antwerp, Universiteitsplein 1, Wilrijk B-2610, Belgium
Paul Govaerts, The Eargroup, Antwerp-Deurne, Belgium
This paper investigates suprasegmental features in the pre-lexical speech of congenitally hearing impaired children who received a Nucleus-24 multichannel cochlear implant between 5 and 20 months of age. Bi-syllabic spontaneous babbling productions were analyzed acoustically at different stages in the babbling period. Fundamental frequency, pitch change in terms of direction and degree, and duration were analyzed for each vowel. The results were compared with those of a control group of normally hearing children, and analyzed with respect to length of cochlear implant experience. Few differences were found between the normally hearing children and the cochlear implanted children in terms of suprasegmental aspects, indicating that a cochlear implant provides fundamental improvement already in pre-lexical speech. Nevertheless, differences in pitch variations between the two groups of children at the end of the babbling period may signify a weakness related to the cochlear implant.
Categorical Perception in Intonation: a Matter of Signal Dynamics?
Oliver Niebuhr, Institute of Phonetics and Digital Speech Processing, IPDS, CAU Kiel
Results of recent perception experiments revealed that the signalling of rising-falling F0 peak categories in German intonation involves an interplay of F0 and intensity. Moreover, combining identification judgements and reaction times suggests that the abruptness of the perceptual change between the categories is determined by the signal dynamics in the sense of the durations of the F0 peak movements and intensity transitions. This undermines the use of categorical perception as an instrument to detect phonological intonation categories.