Interspeech 2007 Session FrB.P3b: Prosody: perception

Type poster
Date Friday, August 31, 2007
Time 10:00 – 12:00
Room Keurvels
Chair Julia Hirschberg (Department of Computer Science, Columbia University)


Dependence of tone perception on syllable perception
Michael Olsberg, University College London
Yi Xu, University College London
Jeremy Green, Orbis Investment Advisory Limited

We tested the hypothesis that, given the consistency of tone-syllable alignment found in recent research, accuracy of tone perception is dependent on the accuracy of syllable perception. In two experiments, subjects either judged the number of syllables or identified the tones in nonsense sentences that were spectrally intact, low-pass filtered at 300 Hz or converted to sustained schwa carrying the original F0. It was found that removing spectral information affected not only subjects’ ability to judge the number of syllables in a sentence, but also their ability to identify the tones. The results thus confirms the dependence of tone perception on syllable perception.

Testing the Relevance of Speech Rate, Pitch and a Glottal Chink for the Perception of Age in Synthesized Speech using Formant Synthesis
Ralf Winkler, Technical University Berlin

Listeners are able to rate a speaker's age with reasonable accuracy. However, it is still controversial which features reliably signal a speaker's age. This paper presents results of a synthesis study, where speech rate, pitch, and a glottal chink were varied systematically over a range that effectively occurs in natural speech to shift the mean perceived age. The strongest impact on age judgements was found for (i) speech rate, followed by (ii) the glottal chink, while the impact of pitch was only marginal. Some interactions (iii) between the parameters were observed as well. Results regarding (i) and (ii) show, that formant synthesis is capable of producing speech considerably varying in its mean perceived age even if only a small number of features are manipulated. Regarding (iii), results indicate, that in the study of the impact of selected features their interactions should be considered too.

Utterance-Final Glottalization as a Cue for Familiar Speaker Recognition
Tamás Bőhm, Department of Telecommunications and Media Informatics, BME, Budapest, Hungary
Stefanie Shattuck-Hufnagel, Research Laboratory of Electronics, MIT, Cambridge, MA USA

Several studies have reported systematic differences across speakers in the rate and type of intermittent irregular vocal fold vibration (glottalization). Still, it remains an open question whether human listeners use this speaker-specific information as a cue for recognizing familiar voices. A perceptual experiment was conducted to investigate this issue, concentrating on irregularity in utterance-final position. A novel method was employed to manipulate the final voice quality (in our case, modal or glottalized). Listeners, who were familiar with the voices of the speakers, were presented pairs of speech samples: one with the original and another with manipulated final voice quality. When listeners were asked to select the member of the pair that was closer to the talker’s voice, they chose the unmanipulated token in 63% of the trials. This result suggests that irregular pitch periods in utterance-final regions play a role in the recognition of individual speaker voices.

A Rule-Based Speech Morphing for Verifying a Expressive Speech Perception Model
Chun-Fang Huang, Japan Advanced Institute of Science and Technology
Masato Akagi, Japan Advanced Institute of Science and Technology

This paper describes a rule-based approach for verifying a three-layer model that was proposed for modeling expressive speech perception. The three layers are expressive speech, semantic primitives, and acoustic features. In our previous work we built the model. In the current work, the built model is verified by creating rules with parameters that morph the acoustic characteristics of a neutral utterance to the perception of certain semantic primitives or expressive speech categories. There are two types of rules. Base rules verify the validity of the analytic results. Intensity rules verify the perceived intensity of expressive speech and semantic primitives. The experiments results show the significant relationships of expressive speech, semantic primitives, and acoustic features. This model will help to develop tools such as a synthesizer to produce utterances that could give listeners the perception of different categories and intensity-levels of expressive speech.

On the Importance of Pure Prosody in the Perception of Speaker Identity
Elina Helander, Institute of Signal Processing, Tampere University of Technology
Jani Nurminen, Nokia Technology Platforms, Tampere, Finland

Many of the current techniques and systems that deal with speaker identity do not regard detailed prosody as a crucial source of speaker-dependent information. The reasoning behind this relates to the common assumption that the F0 level and the spectral data carry all or almost all of the speaker-dependent information. But is this assumption really valid? We have investigated the importance of prosodic information in the perception of speaker identity by conducting a test where the listeners tried to identify people they know after hearing only delexicalized pure prosody signals. The findings presented in this paper show that even a very rough prosodic representation consisting only of a single sinusoid can contain information on speaker identity, giving motivation for the development and wider usage of techniques that better exploit the prosodic aspects.

Perceptual Relevance of Pitch Contours of Mandarin Tones and its Efficacy in Prosody Generation of Speech Synthesis
Shi-Han Chen, Industrial Technology Research Institute
Chih-Chung Kuo, Industrial Technology Research Institute

Modeling Mandarin tones is one of the most important issues in speech synthesis. However, established knowledge is mainly focused on the “production” aspect. In this paper, we first characterized relative pitch levels of tones. Next, two perceptual experiments were designed to investigate “perceptual” relevance of pitch levels and shapes in Mandarin. Results showed that relative pitch levels of tones were perceptually more important than exactness of pitch shapes, and humans could not perceptually distinguish tonal variations in synthesized Chinese names.

The Effect of Filled Pauses in a Lecture Speech on Impressive Evaluation of Listeners
Hiromitsu Nishizaki, University of Yamanashi
Mitsuhiro Sohmiya, University of Yamanashi
Kenji Kobayashi, University of Yamanashi
Yoshihiro Sekiguchi, University of Yamanashi

This paper examines and reports on how "filled pauses" included when delivering speeches influence the understanding, and change the impression, of the speech as shown through the research and experiments we conducted on trial subjects. We conducted research about speeches and lectures given at classes at our university, and at academic meetings. A questionnaire related to filler pauses was given to audiences in university classrooms, and the speeches given where recorded. Then, we prepared a number of speeches that were manually altered to put emphasis on the frequency, position, and duration of filler pauses in the speeches. Comparing those speeches with the original speeches which were not processed in our listening experiments, we were able to estimate the effect of filled pauses in a lecture speech and how effective these were in altering the impressions of the audiences. We were able to find the best conditions related to the frequency, position, and duration of filled pauses, and how these conditions cleary changed a lecture or speech into a better one which is easy to understanding and listen to for the audience.

Perceptual Equivalence of Approximated Cantonese Tone Contours
Yujia Li, Department of Electronic Engineering, The Chinese University of Hong Kong
Tan Lee, Department of Electronic Engineering, The Chinese University of Hong Kong

This paper describes a perceptual study on approximated Cantonese tone contours. We believe that the perception of tone contours relies mainly on the major trend of pitch movement, and is not sensitive to the exact F0 values at particular time instants. The tone contours of individual syllables and the transition between them are approximated as a small number of linear movements. The effect of such approximation is assessed by perceptual experiments. It is found that the six Cantonese tones can be represented by one or two linear movements, and the transition between tones can be represented by a single linear movement, without creating noticeable perceptual difference. Such simple approximations are desirable for perception-driven F0 modeling for text-to-speech applications.

Audiovisual Emotional Speech of Game Playing Children: Effects of Age and Culture
Suleman Shahid, User System Interaction, Eindhoven University of Technology, The Netherlands
Emiel Krahmer, Communication and Cognition, Tilburg University, The Netherlands
Marc Swerts, Communication and Cognition, Tilburg University, The Netherlands

In this paper we study how children of different age groups (8 and 12 years old) and with different cultural backgrounds (Dutch and Pakistani) signal positive and negative emotions in audiovisual speech. Data was collected in an ethical way using a simple but surprisingly effective game in which pairs of participants have to guess whether an upcoming card will contain a higher or lower number than a reference card. The data thus collected was used in a series of cross-cultural perception studies, in which Dutch and Pakistani observers classified emotional expressions of Dutch and Pakistani speakers. Results show that classification accuracy is uniformly high for Pakistani children, but drops for older and for winning Dutch children .

