Interspeech 2007 Session ThD.P1b: Speakers: expression, emotion and personality recognition
Thursday, August 30, 2007
16:00 – 18:00
Elizabeth Shriberg (Speech Technology & Research Laboratory, SRI International)
The Relevance of Feature Type for the Automatic Classification of Emotional User States: Low Level Descriptors and Functionals
Björn Schuller, Institute for Human-Machine Communication, Technische Universität München, Germany
Anton Batliner, Lehrstuhl für Mustererkennung, Friedrich-Alexander-Universität, Erlangen, Germany
Dino Seppi, FBK-irst, Trento, Italy
Stefan Steidl, Lehrstuhl für Mustererkennung, Friedrich-Alexander-Universität, Erlangen, Germany
Thurid Vogt, Multimedia Concepts and their Applications, University of Augsburg, Germany
Johannes Wagner, Multimedia Concepts and their Applications, University of Augsburg, Germany
Laurence Devillers, Spoken Language Processing Group, LIMSI-CNRS, Orsay Cedex, France
Laurence Vidrascu, Spoken Language Processing Group, LIMSI-CNRS, Orsay Cedex, France
Noam Amir, Dep. of Communication Disorders, Sackler Faculty of Medicine, Tel Aviv University, Israel
Loic Kessous, Dep. of Communication Disorders, Sackler Faculty of Medicine, Tel Aviv University, Israel
Vered Aharonson, Afeka, Tel Aviv academic college of engineering, Tel Aviv, Israel
In this paper, we report on classification results for emotional user states (4 classes, German database of children interacting with a pet robot). Six sites computed acoustic and linguistic features independently from each other, following in part different strategies. A total of 4244 features were pooled together and grouped into 12 low level descriptor types and 6 functional types. For each of these groups, classification results using Support Vector Machines and Random Forests are reported for the full set of features, and for 150 features each with the highest individual Information Gain Ratio. The performance for the different groups varies mostly between approx. 50% and approx. 60%.
Automatic question detection: prosodic-lexical features and crosslingual experiments
Vũ Minh Quang, MICA Center
Laurent Besacier, LIG Lab.
Eric Castelli, MICA Center
In this paper, we present our work on automatic question detection from the speech signal. We are interested in developing automatic detection system and investigate the portability of such system to a new language. The first goal of this paper is to propose and evaluate a combined approach for automatic question detection where prosodic features are augmented by the use of lexical features. It is shown that both early and late integration of theses features in a decision tree-based classifier improves the question detection performance compared to a baseline system using prosodic features only. The second goal of this paper is to conduct a crosslingual (French / Vietnamese) evaluation concerning the use of prosodic features. It is shown that our first system developed for French which uses an initial prosodic feature set can be improved using a new feature set that takes into account some specific prosodic characteristics of the Vietnamese tonal language.
Performance Evaluation of HMM-Based Style Classification with a Small Amount of Training Data
Makoto Tachibana, Tokyo Institute of Technology
Keigo Kawashima, Tokyo Institute of Technology
Junichi Yamagishi, Tokyo Institute of Technology
Takao Kobayashi, Tokyo Institute of Technology
This paper describes a classification technique for emotional expressions and speaking styles of speech using only a small amount of training data of a target speaker. We model spectral and fundamental frequency (F0) features simultaneously using multi-space probability distribution HMM (MSD-HMM), and adapt a speaker-independent neutral style model to a certain target speaker's style model with a small amount of data using MSD-MLLR which is extended MLLR for MSD-HMM. We perform classification experiments for professional narrators' speech and non-professional speakers' speech and evaluate the performance of proposed technique by comparing with other commonly used classifiers. We show that the proposed technique gives better result than the other classifiers when using a few sentences of target speaker's style data.
Visualizing acoustic similarities between emotions in speech: an acoustic map of emotions
Khiet Truong, TNO Human Factors
David Van Leeuwen, TNO Human Factors
In this paper, we introduce a visual analysismethod to assess the discriminability and confusiability between emotions according to automatic emotion classifiers. The degree of acoustic similarities between emotions can be defined in terms of distances that are based on pair-wise emotion discrimination experiments. By employing Multidimensional Scaling, the discriminability between emotions can then be visualized in a two-dimensional plot that is relatively easy to interpret. This ‘map of emotions’ is compared to the well-known ‘Feeltrace’ two-dimensional mapping of emotions. While there is correlation with the ‘arousal’ dimension of Feeltrace, it appears that the ‘valence’ dimension is difficult to relate to the acoustic map.
Fusion of Global Statistical and Segmental Spectral Features for Speech Emotion Recognition
Hao Hu, Center for Speech Technology, Tsinghua National Lab for Information Science and Technology, Tsinghua University
Ming-Xing Xu, Center for Speech Technology, Tsinghua National Lab for Information Science and Technology, Tsinghua University
Wei Wu, Center for Speech Technology, Tsinghua National Lab for Information Science and Technology, Tsinghua University
Speech emotion recognition is an interesting and challenging speech technology, which can be applied to broad areas. In this paper, we propose to fuse the global statistical and segmental spectral features at the decision level for speech emotion recognition. Each emotional utterance is individually scored by two recognition systems, the global statistics-based and segmental spectrum-based systems, and a weighted linear combination is applied to fuse their scores for final decision. Experimental results on an emotional speech database demonstrate that the global statistical and segmental spectral features are complementary, and the proposed fusion approach further improves the performance of the emotion recognition system.
Group Delay Features for Emotion Detection
Vidhyasaharan Sethu, School of Electrical Engineering and Telecommunications, The University of New South Wales, Australia
Eliathamby Ambikairajah, School of Electrical Engineering and Telecommunications, The University of New South Wales, Australia
Julien Epps, School of Electrical Engineering and Telecommunications, The University of New South Wales, Australia
This paper focuses on speech based emotion classification utilizing acoustic data. The most commonly used acoustic features are pitch and energy along with prosodic information like rate of speech. We propose the use of a novel feature based on the phase response of an all-pole model of the vocal tract obtained from linear predictive coefficients (LPC) in addition to the aforementioned features. We compare this feature to other commonly used acoustic features based on classification accuracy. The back-end of our system employs a probabilistic neural network based classifier. Evaluations conducted on the LDC Emotional Prosody speech corpus indicate the proposed features are well suited to the task of emotion classification. The proposed features are able to provide a relative increase in classification accuracy of about 14% over established features when combined with them to form a larger feature vector.
Combining Short-term Cepstral and Longer-term Prosodic Features for Automatic Recognition of Speaker Age
Christian Müller, International Computer Science Institute
Felix Burkhardt, T-Systems
The most successful systems in previous comparative studies on speaker age recognition used short-term cepstral features modeled with Gaussian Mixture Models (GMMs) or applied multiple phone recognizers trained with the data of speakers of the respective class. Acoustic analyses, however, indicate that certain features such as pitch extracted from a longer span of speech correlate clearly with the speaker age although the systems based on those features have been inferior to the before mentioned approaches. In this paper, three novel systems combining short-term cepstral features and long-term features for speaker age recognition are compared to each other. A system combining GMMs using frame-based MFCCs and Support-Vector-Machines using long-term pitch performs best. The results indicate that the combination of the two feature types is a promising approach, which corresponds to findings in related fields like speaker recognition.
Detecting Deception Using Critical Segments
Frank Enos, Columbia University
Elizabeth Shriberg, SRI/ICSI
Martin Graciarena, SRI
Julia Hirschberg, Columbia University
Andreas Stolcke, SRI/ICSI
In this paper we present an investigation of segments that map to GLOBAL LIES, that is, the intent to deceive with respect to salient topics of the discourse. We propose that identifying the truth or falsity of these CRITICAL SEGMENTS may be important in determining a speaker s veracity over the larger topic of discourse. Further, answers to key questions, which can be identified a priori, may represent emotional and cognitive HOT-SPOTS, analogous to those observed by psychologists who study gestural and facial cues to deception. We present results of experiments that use two different definitions of CRITICAL SEGMENTS and employ machine learning techniques that compensate for imbalances in the dataset. Using this approach, we achieve a performance gain of 23.8% relative to chance, in contrast with human performance on a similar task, which averages substantially below chance. We discuss the features used by the models, and consider how these findings can influence future research.
Style Estimation of Speech Based on Multiple Regression Hidden Semi-Markov Model
Takashi Nose, Tokyo Institute of Technology
Yoichi Kato, Tokyo Institute of Technology
Takao Kobayashi, Tokyo Institute of Technology
This paper presents a technique for estimating the degree or intensity of emotional expressions and speaking styles appeared in speech. The key idea is based on a style control technique for speech synthesis using multiple regression hidden semi-Markov model (MRHSMM), and the proposed technique can be viewed as the inverse process of the style control. We derive an algorithm for estimating predictor variables of MRHSMM each of which represents a sort of emotion intensity or speaking style variability appeared in acoustic features based on an ML criterion. We also show preliminary experimental results to demonstrate an ability of the proposed technique for synthetic and acted speech samples with emotional expressions and speaking styles.
Analysis and Classification of Speech Mode: Whispered through Shouted
Chi Zhang, Center of Robust Speech Systems, Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas 75083, USA
John Hansen, Center of Robust Speech Systems, Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas 75083, USA
Variation in vocal effort represents one of the most challenging problems in maintaining speech system performance for coding, speech and speaker recognition. Changes in vocal effort result in a fundamental change in speech production. This is the first study to collectively consider the five speech modes: whispered, soft, neutral, loud and shouted. After corpus development, analysis is performed for SIL, duration and silence percentage, frame energy distribution and spectral tilt. The analysis shows vocal effort dependent traits which are used to investigate speaker recognition. Matched vocal mode conditions result in a closed-set speaker ID rate of 97.62% (54.02% for mismatch vocal conditions). A speech mode classification system is developed, which produce a 70% classification rate (96.7% for whispered).These advancements can provide improved speech/speaker modeling information, as well as classified vocal mode knowledge to improve speech and language technology in real scenarios.