Interspeech 2007 Session WeD.O2: Discourse, dialogue and emotion expression
Wednesday, August 29, 2007
16:00 – 18:00
Johanna Moore (University of Edinburgh)
Integrating Audio and Visual Cues for Speaker Friendliness in Multimodal Speech Synthesis
David House, Dept. of Speech, Music and Hearing, CSC, KTH, Stockholm, Sweden
This paper investigates interactions between audio and visual cues to friendliness in questions in two perception experiments. In the first experiment, manually edited parametric audio-visual synthesis was used to create the stimuli. Results were consistent with earlier findings in that a late, high final focal accent peak was perceived as friendlier than an earlier, lower focal accent peak. Friendliness was also effectively signaled by visual facial parameters such as a smile, head nod and eyebrow raising synchronized with the final accent. Consistent additive effects were found between the audio and visual cues for the subjects as a group and individually showing that subjects integrate the two modalities. The second experiment used data-driven visual synthesis where the database was recorded by an actor instructed to portray anger and happiness. Friendliness was correlated to the happy database, but the effect was not as strong as for the parametric synthesis.
The influence of masking words on the prediction of TRPs in a shadowed dialog
Wieneke Wesseling, IFA/ACLC, University of Amsterdam, The Netherlands
R.J.J.H. Van Son, IFA/ACLC, University of Amsterdam, The Netherlands
Louis C.W. Pols, IFA/ACLC, University of Amsterdam, The Netherlands
It is well known that listeners can ignore disturbances in speech and rely on context to interpolate the message. This fact is used to determine the importance of individual words for projecting Transition Relevance Places, TRPs. Subjects were asked to shadow manipulated pre-recorded dialogs with minimal responses, saying "ah" when they feel it is appropriate. In these dialogs, at random, of each utterance, either one of the last four words was replaced by white noise (masked condition), or no word was replaced (non masked condition). The reaction times were analyzed for effects of masked words. The presence of masked words, even prominent words, did not affect the response times of our subjects unless the very last word of the utterance was masked. This indicates that listeners are able to seamlessly interpolate the missing words and only need the identity of the last word to determine the exact position of the TRP.
Analysis of the Occurrence of Laughter in Meetings
Kornel Laskowski, Universitaet Karlsruhe
Susanne Burger, Carnegie Mellon University
Automatic speech understanding in natural multiparty conversation settings stands to gain from parsing not only verbal but also non-verbal vocal communicative behaviors. In this work, we study the most frequently annotated non-verbal behavior, laughter, whose detection has clear implications for speech understanding tasks, and for the automatic recognition of affect in particular. To complement existing acoustic descriptions of the phenomenon, we explore the temporal patterning of laughter over the course of conversation, with a view towards its automatic segmentation and detection. We demonstrate that participants vary extensively in their use of laughter, and that laughter differs from speech in its duration and in the regularity of its occurrence. We also show that laughter and speech are quite dissimilar in terms of the degree of simultaneous vocalization by multiple participants, and the probability of transitioning into and out of vocalization overlap states.
Incremental perception of acted and real emotional speech
Pashiera Barkhuysen, University of Tilburg
Emiel Krahmer, University of Tilburg
Marc Swerts, University of Tilburg
This paper reports on an experiment using the gating paradigm to test the recognition speed for various emotional expressions from a speaker’s face. In a perception experiment, subjects were presented with video clips of speakers who displayed negative or positive emotions, which were either acted or real. The clips were shown in successive segments (gates) of increasing duration. Results show that subjects are surprisingly accurate in their recognition of the various emotions, as they already reach high recognition scores in the first gate (after only 160 milliseconds). Interestingly, the recognition speed is faster for positive than negative emotions, in line with comparable valency effects reported by Leppänen and Hietanen (2003). Finally, the gating results confirm earlier findings that acted emotions are perceived as more intense than true emotions (Wilting et al., 2006), as the former get more extreme recognition scores than the latter, already after a short period of exposure.
Speaking through a noisy channel - Experiments on inducing clarification behaviour in human-human dialogue
David Schlangen, University of Potsdam, Germany
Raquel Fernández, University of Potsdam, Germany
We report results of an experiment on inducing communication problems in human-human dialogue. We set up a voice-only cooperative task where we manipulated one channel by replacing (in real-time, at random points) all signal with noise. Altogether around 10% of the speaker's signal was thus removed. We found an increase in clarification requests of a form that has previously been hypothesized to be used mainly for clarifying acoustic problems. We also found a correlation between the precentage of an utterance being manipulated and the use of devices for pointing out error locations. From our findings, we derive a gold-standard policy for clarification behaviour that could be of use for designers of spoken dialogue systems.
Computerized chironomy: evaluation of hand-controlled Intonation reiteration
Christophe d'Alessandro, LIMSI-CNRS
Albert Rilliard, LIMSI-CNRS
Sylvain Le Beux, LIMSI-CNRS
This paper addressed the question of intonation modeling in terms of hand movements (chironomy). An experiment in hand-controlled intonation reiteration is described. A system for real-time intonation modification driven by a graphic tablet is presented. This system is used for reiterating a speech corpus (sentences of 1 to 9 syllables, natural and reiterant speech). The subjects also produced vocal imitation of the same corpus. Correlation and distances between natural and reiterated intonation contours are measured. These measures show that chironomic reiteration and vocal reiteration give comparable, and good, results. This paves the way to several applications in expressive intonation synthesis and to a new intonation modeling paradigm in terms movements.