Elizabeth Shriberg (Speech Technology & Research Laboratory, SRI International)
An analysis of individual differences in the F0 contour and the duration of anger utterances at several degrees
Hiromi Kawatsu, Graduate School of Bionics, Computer and Media Sciences, Tokyo University of Technology
Sumio Ohno, Graduate School of Bionics, Computer and Media Sciences, Tokyo University of Technology
Taking up anger emotion expressed by speech, prosodic features were analyzed in order to find out relationship between the degree of anger and manifestations on the speech signal in terms of individual differences. As a result of analysis, there were some common features among the speakers, although there were some speaker-dependent features. About the baseline frequency and the magnitude of the first phrase command, common tendencies were found. The amplitude of the accent command increases as the emotional degree increases. Some speakers emphasized accent commands at all positions within a sentence, some emphasized only near the end of a sentence. Speaking rate at the 1st and 4th phrases were faster than those at the 2nd and 3rd phrases for the utterance with emotion, although there was an individual difference in the effect of the emotional degree. It is interesting that two aspects in prosody might be complement each other in order to represent a difference of emotional degrees.
Acoustic Features of Anger Utterances during Natural Dialog
Yoshiko Arimoto, Graduate School of Bionics, Computer and Media Sciences, Tokyo University of Technology
Sumio Ohno, School of Computer Science, Tokyo University of Technology
Hitoshi Iida, School of Media Science, Tokyo University of Technology
This report focuses on an automatic estimation of speakers' anger emotion degree. Two kinds of pseudo-dialogs were held to collect spontaneous anger utterances during the natural Japanese dialog. In order to quantify the anger degree of utterances, a six-scale subjective evaluation was conducted to grade every utterance according to an anger emotion degree by twelve evaluators. With this data set, acoustic features of each utterance were examined to clarify what is the clue to estimate degree of anger utterances. To examine the possibility of automatic emotion estimation, we conduct experiments to estimate the degree of anger emotion automatically by multiple regression analysis using the acoustic parameters.
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis
Fadi Biadsy, Columbia University
Julia Hirschberg, Columbia University
Andrew Rosenberg, Columbia University
Wisam Dakka, Columbia University
Charisma, the ability to lead by virtue of personality alone, is difficult to define but relatively easy to identify. However, cultural factors clearly affect perceptions of charisma. In this paper we compare results from parallel perception studies investigating charismatic speech in Palestinian Arabic and American English. We examine acoustic/prosodic and lexical correlates of charisma ratings to determine how the two cultures differ with respect to their views of charismatic speech.
Using Neutral Speech Models for Emotional Speech Analysis
Carlos Busso, University of Southern California
Sungbok Lee, University of Southern California
Shrikanth Narayanan, University of Southern California
Since emotional speech can be regarded as a variation on neutral (non-emotional) speech, it is expected that a robust neutral speech model can be useful in contrasting different emotions expressed in speech. This study explores this idea by creating acoustic models trained with spectral features, using the emotionally-neutral TIMIT corpus. The performance is tested with two emotional speech databases: one recorded with a microphone (acted), and another recorded from a telephone application (spontaneous). It is found that accuracy up to 78% and 65% can be achieved in the binary and category emotion discriminations, respectively. Raw Mel Filter Bank (MFB) output was found to perform better than conventional MFCC, with both broad-band and telephone-band speech. These results suggest that well-trained neutral acoustic models can be effectively used as a front-end for emotion recognition, and once trained with MFB, it may reasonably work well regardless of the channel characteristics.
Emotion Clustering Using the Results of Subjective Opinion Tests for Emotion Recognition in Infants' Cries
Noriko Satoh, Nagasaki University
Katsuya Yamauchi, Nagasaki University
Shoichi Matsunaga, Nagasaki University
Masaru Yamashita, Nagasaki University
Ryuta Nakagawa, Nagasaki University
Kazuyuki Shinohara, Nagasaki University
This paper proposes an emotion clustering procedure for emotion detection in infants' cries. Our procedure is performed using the results of subjective opinion tests regarding the emotions expressed in infants' cries. Through the procedure, we obtain a tree data structure of emotion clusters that are generated by the progressive merging of emotions. Emotion merging is carried out on the condition that the objective function concerning the ambiguity of emotions that were detected in the opinion tests is minimized. The experimental results show that the proposed clustering, which considers the evaluation rank of each emotion, is superior to the clustering that is only concerned with the detection/nondetection of each emotion. Based on the clustering results, we performed a recognition experiment on two emotion clusters. According to the recognition results, the proposed emotion cluster achieves a detection rate of 75%, which shows the effectiveness of the proposed procedure.
On the limitations of voice conversion techniques in emotion identification tasks
Roberto Barra, Universidad Politecnica de Madrid
Juan M. Montero, Universidad Politecnica de Madrid
Javier Macias-Guarasa, Universidad Politecnica de Madrid
Juana Gutierrez-Arriola, Universidad Politecnica de Madrid
Javier Ferreiros, Universidad Politecnica de Madrid
Jose M. Pardo, Universidad Politecnica de Madrid
The growing interest in emotional speech synthesis urges effective emotion conversion techniques to be explored. This paper estimates the relevance of three speech components (spectral envelope, residual excitation and prosody) for synthesizing identifiable emotional speech, in order to be able to customize voice conversion techniques to the specific characteristics of each emotion. The analysis has been based on a listening test with a set of synthetic mixed-emotion utterances that draw their speech components from emotional and neutral recordings. Results prove the importance of transforming residual excitation for the identification of emotions that are not fully conveyed through prosodic means (such as cold anger or sadness in our Spanish corpus).
Use of Lexical and Affective Prosodic Cues to Emotion by Younger and Older Adults
Kate Dupuis, University of Toronto
Kathleen Pichora-Fuller, University of Toronto
Older adults often report that, although they are able to hear conversations, they have difficulty attending to or understanding what is being said. Interactions between cognitive and perceptual processing are necessary for comprehension. It is possible that older adults have difficulty determining what type of emotional information is represented by the affective tone in which speech is spoken. Two studies were conducted using sentences to examine the use of lexical and affective prosodic cues to emotion by younger and older adults. The present studies are the first steps in a research programme concerning the effect of age on perceptual and cognitive interactions during comprehension of affective prosody.
Two-Stream Emotion Recognition For Call Center Monitoring
Nitendra Rajput, IBM Research
Purnima Gupta, Indian Institute of Technology
We present a technique for two-stream processing of speech signals for emotion detection. The first stream recognises emotion from acoustic features while the second stream recognises emotion from the semantics of the conversation. A probabilistic measure is derived for each of the individual streams and the emotion category from the two streams is recognises. The output of the two streams is combined to generate a score for a particular emotion category. The confidence level of each stream is used to weigh the scores from the two streams while generating the final score. This technique is extremely significant for call-center data that have some semantics associated with the speech. The proposed technique is evaluated on the LDC corpus and on the real-word call-center data. Experiments suggest that use of a two-stream process provides better results than the existing techniques of extracting emotion only from acoustic features.
The role of intonation and voice quality in the affective speech perception
Ioulia Grichkovtsova, CRISCO, University of Caen, France
Anne Lacheret, MoDyCO, University of Paris X, Nanterre, France
Michel Morel, CRISCO, University of Caen, France
The perception value of intonation and voice quality is investigated for six affective states: anger, sadness, happiness, obviousness, doubt and irony. The main research question is whether the role of intonation and voice quality is equally important in the perception of studied affective states or whether one of them may be privileged. Six affective states were tested on utterances with natural lexical meaning. The transplantation paradigm was used in designing audio stimuli. Perception results show that each studied affective state has its own usage of prosody and voice quality. Some differences were found in the identification of emotions and attitudes. New questions risen from the present study and further directions of work are presented.
Combining Frame and Turn-Level Information for Robust Recognition of Emotions within Speech
Bogdan Vlasenko, Cognitive Systems, IESK, Otto-von-Guericke University, Magdeburg, Germany
Björn Schuller, Institute for Human-Machine Communication, Technische Universität München, Germany
Andreas Wendemuth, Cognitive Systems, IESK, Otto-von-Guericke University, Magdeburg, Germany
Gerhard Rigoll, Institute for Human-Machine Communication, Technische Universität München, Germany
Current approaches to the recognition of emotion within speech usually use statistic feature information obtained by application of functionals on turn- or chunk levels. Yet, it is well known that thereby important information on temporal sub-layers as the frame-level is lost. We therefore investigate the benefits of integration of such information within turn-level feature space. For frame-level analysis we use GMM for classification and 39 MFCC and energy features with CMS. In a subsequent step output scores are fed forward into a 1.4k large-feature-space turn-level SVM emotion recognition engine. Thereby we use a variety of Low-Level-Descriptors and functionals to cover prosodic, speech quality, and articulatory aspects. Extensive test-runs are carried out on the public databases EMO-DB and SUSAS. Speaker-independent analysis is faced by speaker normalization. Overall results highly emphasize the benefits of feature integration on diverse time scales.