Interspeech 2007 Session WeD.O3: Prosodic modeling II
Wednesday, August 29, 2007
16:00 – 18:00
Esther Klabbers (CSLU, Oregon Health & Science University)
Corpus-based Generation of Prosodic Features from Text Based on Generation Process Model
Keikichi Hirose, Dept. of Inf. and Commu. Engineering, School of Inf. Science and Tech., University of Tokyo
Keiko Ochi, Dept. of Frontier Informatics, School of Frontier Sciences, University of Tokyo
Nobuaki Minematsu, Dept. of Frontier Informatics, School of Frontier Sciences, University of Tokyo
A total scheme of generating prosodic features from a text input was constructed. The method consists of corpus-based prediction of pauses, phone durations and fundamental frequencies (F0's), in this order, and information predicted in an earlier process is utilized in the following processes. Since prediction of F0's is done on the command values of F0 contour generation process model instead of direct F0 values, a stable and flexible control of F0 contours is possible. By adding constraints on the accent command timings as a post processing, a better quality was realized when speech was synthesized using prosodic features generated by the method. Validity of the developed method was confirmed through the listening test of the synthetic speech.
Novel Eigenpitch-based Prosody Model for Text-to-Speech Synthesis
Jilei Tian, Nokia Research Center
Jani Nurminen, Nokia Technology Platform
Imre Kiss, Nokia Research Center
Prosody is an inherent supra-segmental feature in speech that human speakers employ to express, for example, attitude, emotion, intent and attention. In text-to-speech (TTS) systems, high naturalness can only be achieved if the prosody of the output is appropriate. The importance of prosody is even more crucial for tonal languages, such as Mandarin Chinese, in which the tone of each syllable is described by its pitch contour. In this paper, we propose a novel prosody modeling approach that uses the concept of syllable-based eigenpitch. The approach has been implemented in our Mandarin TTS system resulting in less than 0.1% error variance. The results obtained in practical experiments have confirmed the good performance of the proposed technique.
Modelling Prominence and Emphasis Improves Unit-Selection Synthesis
Volker Strom, CSTR, University of Edinburgh
Ani Nenkova, Linguistics Department, Stanford University
Robert Clark, CSTR, University of Edinburgh
Yolanda Vazquez-Alvarez, CSTR, University of Edinburgh
Jason Brenier, Linguistics Department, Stanford University
Simon King, CSTR, University of Edinburgh
Dan Jurafsky, Linguistics Department, Stanford University
We describe the results of large scale perception experiments showing improvements in synthesising two distinct kinds of prominence: standard pitch-accent and strong emphatic accents. Previously prominence assignment has been mainly evaluated by computing accuracy on a prominence-labelled test set. By contrast we integrated an automatic pitch-accent classifier into the unit selection target cost and showed that listeners preferred these synthesised sentences. We also describe an improved recording script for collecting emphatic accents, and show that generating emphatic accents leads to further improvements in the fiction genre over incorporating pitch accent only. Finally, we show differences in the effects of prominence between child-directed speech and news and fiction genres.
A Framework of Reply Speech Generation for Concept-to-Speech Conversion in Spoken Dialogue Systems
Seiya Takada, Graduate School of Information Science and Technology, University of Tokyo, Japan
Yuji Yagi, Graduate School of Engineering, University of Tokyo, Japan
Keikichi Hirose, Graduate School of Information Science and Technology, University of Tokyo, Japan
Nobuaki Minematsu, Graduate School of Frontier Sciences, University of Tokyo, Japan
Due to recent advancements in speech technologies, a large number of spoken dialogue systems have been constructed. However, since most of them adopt existing text-to-speech synthesizers, it is rather difficult to reflect the linguistic information obtained during the reply sentence generation well in output speech. A framework is necessary for correctly reflecting higher-level linguistic information, such as syntactic structure and discourse information. We have constructed a spoken dialogue system on road guidance and realized concept-to-speech conversion, where output speech is generated in a unified process. Tag LISP forms keep the syntactic structures throughout the process in order to reflect the linguistic information in the prosody of output speech. Furthermore, by making it possible to insert not only words but also phrase templates in tags, various sentences were generated with a minor increase of templates. Validity of the methods is shown through experiments.
Synthesis of prosodic attitudinal variants in German backchannel ja
Thorsten Stocksmeier, Technische Fakultät, Universität Bielefeld
Dafydd Gibbon, Department of Linguistics, Universität Bielefeld
Stefan Kopp, Technische Fakultät, Universität Bielefeld
Feedback utterances are an important part of any dialog between humans. When two or more persons talk, they use short backchannel utterances to signal understanding and interest in the conversation. Surprisingly little is known about the relationship between the accompanying prosody and the meaning of feedback perceived by the dialog partner. We present a qualitative modelling study of 12 synthesized German ja (yes) interjections that shows the influence of prosodic features on emotional and pragmatic perception of this kind of feedback. Listeners perceived utterances as bored, hesitant, or happy and agreeing depending on the prosodic parameters used for synthesis.
Inter-language prosodic style modification experiment using word impression vector for communicative speech generation
Ke Li, GITI/Language and Speech Science Res. Lab Waseda University
Yoko Greenberg, GITI/Language and Speech Science Res. Lab Waseda University
Yoshinori Sagisaka, GITI/Language and Speech Science Res. Lab Waseda University
To confirm the language independency of a communicative prosody generation from input word impression vector, we synthesized communicative Mandarin speech using prosodic characteristics of communicative Japanese speech. The fundamental frequency and duration characteristics of one-word “n” utterances of Japanese were copied to Mandarin through input word attributes. From the subjective impressions of an input word, a three-dimensional vector was calculated through Multi-Dimensional Scaling analysis. Three dimensions reflecting impressions of confident-doubtful, allowable-unacceptable and positive-negative correspond to systematic prosodic variations; F0 height, F0 dynamics and duration. Subjective evaluation of synthesized speech showed the possibility of communicative prosody generation from input word impression vector language independently.