Interspeech 2007 Session TuD.P2b: Prosodic modeling I
Tuesday, August 28, 2007
16:00 – 18:00
Ani Nenkova (Linguistics Department, Stanford University)
Modeling Incompletion Phenomenon in Mandarin Dialog Prosody
Yu Jian, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Lixing Huang, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Jianhua Tao, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Xia Wang, Nokia Research Centre, China
The paper proposes a prosody generation method for dialog speech synthesis in Mandarin. The method is an extension of a prosody model for read speech and also takes the essential characteristic of dialog speech into account. Besides the faster speaking rate and narrower pitch range in dialog speech, our method concentrates on the more underlying and essential characteristic: the incompletion of pitch contour within a syllable and its impacts on adjacent syllables. To simulate this phenomenon, a CART-based method is constructed to predict whether a syllable is incomplete or not. Based on that, a prosody generation model which focuses on the prosody constraint between adjacent syllables is constructed, and this method can simulate the influence of incomplete syllable on adjacent syllables. Experiments show that the synthesized results based on that prosody model sound much natural and colloquial.
Stress Assignment Algorithm in Hungarian Based on Syntactic Analysis
Anne Tamm, Research Institute for Linguistics, Academy of Sciences, Hungary
Kálmán Abari, Department of Psychology, University of Debrecen, Hungary
Gábor Olaszy, Research Institute for Linguistics, Academy of Sciences, Hungary
This article presents the results of the research aimed at developing an accent assignment system for Hungarian. Two methods are compared that predict the stress distribution in sentences: shallow and deep. The shallow method targets local and short-distance factors that determine accent; the deep (syntactic) method targets long-distance influences (such as focus). Neither of the methods alone results in absolutely satisfactory output; frequently, however, mistakes are complementary. The article presents the problems and solutions of both methods.
An Effective Initial/Final Duration Prediction Method for Corpus-based Singing Voice Synthesis of Mandarin Chinese
Cheng-Yuan Lin, Department of Computer Science, National Tsing Hua University, Taiwan
Pei-Chi Jao, Department of Computer Science, National Tsing Hua University, Taiwan
J.-S. Roger Jang, Department of Computer Science, National Tsing Hua University, Taiwan
In this paper, we propose an effective initial/final duration prediction method for corpus-based singing voice synthesis of Mandarin Chinese. The goal of the method is to improve the naturalness and clarity of synthetic singing voices. Under the framework of the method, we construct the individual initial/final (I/F) duration prediction model for each category of consonant. Support vector machine is used as the regression kernel of each model. In order to achieve better prediction accuracy, we use not only linguistic and phonetic attributes but also music-score information as the input features for the I/F duration prediction model. Experimental results demonstrate that the proposed method is feasible and effective for the I/F duration prediction in singing voice synthesis.
Increasing Prosodic Variability of Text-To-Speech Synthesizers
Géza Németh, Budapest University of Technology and Economics, Hungary
Márk Fék, Budapest University of Technology and Economics, Hungary
Tamás Gábor Csapó, Budapest University of Technology and Economics, Hungary
The lack of prosody variation in text-to-speech systems contributes to their perceived unnaturalness when synthesizing extended passages. In this paper, we present a method to improve prosody generation in this direction. A database of natural sample sentences is searched for sentences having similar word and syllable structure to the input. One sentence is selected randomly from the similar sentences found. The prosody of the randomly selected natural sentence is used as a target to generate the prosody of the synthetic one. An experiment was conducted to determine the potential of the proposed method. The rule-based pitch contour generation of a Hungarian concatenative synthesizer was replaced by a semi-automatic implementation of the proposed method. A listening test showed that subjects preferred sentences synthesized by the proposed method over a rule-based solution.
Unsupervised HMM classification of F0 curves
Damien Lolive, IRISA, Rennes, France
Nelly Barbot, IRISA, Rennes, France
Olivier Boeffard, IRISA, Rennes, France
This article describes a new unsupervised methodology to learn F0 classes using HMM models on a syllable basis. A F0 class is represented by a HMM with three emitting states. The clustering algorithm relies on an iterative Gaussian splitting and EM retraining process. First, a single class is learnt on a training corpus (8000 syllables) and it is then divided by perturbing Gaussian means of successive levels. At each step, the mean RMS error is evaluated on a validation corpus (3000 syllables). The algorithm stops automatically when the error becomes stable or increases. The syllabic structure of a sentence is the reference level we have taken for F0 modelling even if the methodology can be applied to other structures. Clustering quality is evaluated in terms of cross-validation using a mean of RMS errors between F0 contours on a test corpus and the estimated HMM trajectories. The results show a pretty good quality of the classes (mean RMS error around 4Hz).
Automatic Pitch Accent Prediction for Text-To-Speech Synthesis
Ian Read, University of East Anglia
Stephen Cox, University of East Anglia
Determining pitch accents in a sentence is a key task for a text-to-speech (TTS) system. We describe some methods for pitch accent assignment which make use of features that contain information about a complete phrase or sentence, in contrast to most previous work which has focused on using features local to a syllable or word. Pitch accent prediction is performed using three different techniques: N-gram models of syllable sequences, dynamic programming to match sequences of features, and decision trees. Using a C4.5 decision tree trained on a wide range of features, most notably each word's orthographic form and information extracted from the syntactic parse of the sentence, our feature set achieved a balanced error rate of 46.6%. This compares with the feature set used in Sun, 2002, which had a balanced error rate of 55.55%.
An Unsupervised Approach to Automatic Prosodic Annotation
Xinqiang Ni, Institute of Electronics, Chinese Academy of Sciences
Yining Chen, Microsoft Research Asia
Frank Soong, Microsoft Research Asia
Min Chu, Microsoft Research Asia
Ping Zhang, Institute of Electronics, Chinese Academy of Sciences
Accent is probably the most prominent part in prosodic events. Automatic accent labeling is important for both speech synthesis and automatic speech understanding. However, manually labeling data for traditional supervised learning is expensive and time consuming. In this paper, we propose an unsupervised learning algorithm to label accent automatically. First, we assume all content words are accented. We build an initial acoustic model with accented vowels in content words and high confidence unaccented vowels in function words. Then an iterative progress is executed to convergence. Experimental results show that this unsupervised learning algorithm achieves about 90% agreement on accent labeling. Compared with 84.3%, the accuracy of a typical linguistic classifier, a 30% relative error reduction is obtained.
A System for Transforming the Emotion in Speech: Combining Data-Driven Conversion Techniques for Prosody and Voice Quality
Zeynep Inanoglu, Cambridge University
Steve Young, Cambridge University
This paper describes a system that combines independent transformation techniques to endow a neutral utterance with some required target emotion. The system consists of three modules that are each trained on a limited amount of speech data and act on differing temporal layers. F0 contours are modelled and generated using context-sensitive syllable HMMs, while durations are transformed using phone-based relative decision trees. For spectral conversion which is applied at the segmental level, two methods were investigated: a GMM-based voice conversion approach and a codebook selection approach. Converted test data were evaluated for three emotions using an independent emotion classifier as well as perceptual listening tests. The listening test results show that perception of sadness output by our system was comparable with the perception of human sad speech while the perception of surprise and anger was around 5% worse than that of a human speaker.
An Automatic Prosody Labeling Method for Mandarin Speech
Chen Yu Chiang, Dept. of Communication Engineering, National Chiao Tung University, Taiwan
Hsiu Min Yu, Dept. of Foreign Languages and Literature, Chung Hua University, Taiwan
Yih Ru Wang, Dept. of Communication Engineering, National Chiao Tung University, Taiwan
Sin Horng Chen, Dept. of Communication Engineering, National Chiao Tung University, Taiwan
A new model-based automatic prosody labeling method for Mandarin speech is proposed. It first introduces four models to describe the relationships of the prosody tags to be labeled, the prosodic features of the speech signals, and the linguistic features of the associated texts. It then employs a sequential optimization procedure to estimate parameters of these four models and find all prosody tags. Experimental results on the Sinica Tree-Bank corpus showed that most prosody tags labeled were meaningful and the estimated parameters of these four models matched well with our a priori knowledge about Mandarin prosody.