Interspeech 2007 Session FrC.P2: Speech synthesis II
Friday, August 31, 2007
13:30 – 15:30
Volker Strom (CSTR, University of Edinburgh)
Implementation and Evaluation of an HMM-based Thai Speech Synthesis System
Suphattharachai Chomphan, Tokyo Institute of Technology
Takao Kobayashi, Tokyo Institute of Technology
This paper describes a novel approach to the realization of Thai speech synthesis. Spectrum, pitch, and phone duration are modeled simultaneously in a unified framework of HMM, and their parameter distributions are clustered independently by using a decision-tree based context clustering technique with different styles. A group of contextual factors which affect spectrum, pitch, and state duration, i.e., tone type, part of speech, are taken into account especially for a tonal language. The evaluation of the synthesized speech shows that tone correctness is significantly improved in some clustering styles, moreover the implemented system gives the better reproduction of prosody (or naturalness, in some sense) than the unit-selection-based system with the same speech database.
Speech Synthesis enhancement in noisy environments
Davide Bonardo, Loquendo S.p.A.
Enrico Zovato, Loquendo S.p.A.
This paper reports recent activities made to improve the intelligibility of synthesized speech in noisy environments. Nowadays Text-To-Speech technologies (TTS) are used in many embedded devices like mobile phones, PDAs, car navigation systems, etc. This means that speech can be produced in different types of environments where background noise can significantly degrade the perception of the synthetic message and consequently its intelligibility. The features discussed in this paper are being developed and assessed inside the EU funded SHARE project whose goal is to develop a multimodal communication system supporting rescue operations and disaster management.
Tagging Syllable Boundaries With Hidden Markov Models
Helmut Schmid, Institute of Natural Language Processing (IMS), University of Stuttgart, Germany
Bernd Möbius, Institute of Natural Language Processing (IMS), University of Stuttgart, Germany
Julia Weidenkaff, Institute of Natural Language Processing (IMS), University of Stuttgart, Germany
This paper presents a statistical method for the segmentation of words into syllables which is based on a Hidden Markov model. Our system assigns syllable boundaries to phonetically transcribed words. The syllabification task was formulated as a tagging task. The syllable tagger was trained on syllable-annotated phone sequences. In an evaluation using ten-fold cross-validation, the system correctly predicted the syllabification of German words with an accuracy by word of 99.85%, which clearly exceeds results previously reported in the literature. The best performance was observed for a context size of five preceding phones. A detailed qualitative error analysis suggests that a further reduction of the error rate by up to 90% is possible by eliminating inconsistencies in the training database.
Hierarchal Non-uniform Unit Selection based on Prosodic Structure
Jun Xu, Department of Computer Science and Technology, Tsinghua University, Beijing, China
Dezhi Huang, Speech and Natural Language Processing Unit, France Telecom R&D Beijing, China
Yongxin Wang, Department of Computer Science and Technology, Tsinghua University, Beijing, China
Lianhong Cai, Department of Computer Science and Technology, Tsinghua University, Beijing, China
In speech synthesis systems based on wave concatenation, using longer wave segments can generate more natural synthetic speech. In order to make use of any longer segments that is ready in our corpus, this paper introduces a hierarchal non-uniform unit selection framework. Each layer composed in the framework is an independent searching procedure which searches for different sized units and adopts suitable naturalness measuring functions related to the unit type. We have applied it to our Mandarin speech synthesis system according to the Chinese prosodic structure with the statistical result of the prosodic units in the corpus. Experiment result shows it outperforms our previous system.
Control of an Articulatory Speech Synthesizer based on Dynamic Approximation of Spatial Articulatory Targets
Peter Birkholz, Institute for Computer Science, University of Rostock
We present a novel approach to the generation of speech movements for an articulatory speech synthesizer. The movements of the articulators are modeled by dynamical third order linear systems that respond to sequences of simple motor commands. The motor commands are derived automatically from a high level schedule for the input phonemes. The proposed model considers velocity differences of the articulators and accounts for coarticulation between vowels and consonants. Preliminary tests of the model in the framework of an articulatory speech synthesizer indicate its potential to produce realistic speech movements and thereby to contribute to a higher quality of the synthesized speech.
A Preselection Method Based on Cost Degradation from the Optimal Sequence for Concatenative Speech Synthesis
Nobuyuki Nishizawa, KDDI R&D Laboratories Inc.
Hisashi Kawai, KDDI R&D Laboratories Inc.
A novel unit preselection criterion for concatenative speech synthesis is proposed. To reduce the computational cost for unit selection, units that are unlikely to be selected should be pruned as preselection before Viterbi search. Since the criterion is defined as the difference between the cost of the locally optimal sequence where a unit is fixed and that of the globally optimal sequence, not only the target cost but also the concatenation cost can be taken into account in preselection. For real-time speech synthesis, a preselection method using decision trees, where a unit can be bound to multiple nodes of a tree, is also introduced. Results of a unit selection experiment show that the proposed method using decision trees built from 8-hour training data is superior in the costs of the selected units to the conventional online preselection based on target costs. The experimental results also show that the method is more effective where the computational cost is strongly limited.
Line Cepstral Quefrencies and Their Use for Acoustic Inventory Coding
Guntram Strecha, Technische Universitaet Dresden
Matthias Eichner, Technische Universitaet Dresden
Ruediger Hoffmann, Technische Universitaet Dresden
Line spectral frequencies (LSF) are widely used in the field of speech coding. Due to its properties, the LSF are qualified for the quantisation and the efficient compression of speech signals. In this paper we introduce the line cepstral quefrencies (LCQ). They are derived from the cepstrum in the same manner as the LSF are derived from linear predictive coding (LPC) features. We show that the combination of the pole-zero transfer function of the cepstrum with the properties of LSF offers advantages for speech coding. We apply the LCQ features to compress an acoustic inventory, which is used for a low resource speech synthesis. It is shown that the compression performance of the LCQ features is better than those of the LSF features in terms of the mean spectral distance to the original inventory.
Articulatory Acoustic Feature Applications in Speech Synthesis
Peter Cahill, UCD
Daniel Aioanei, UCD
Julie Carson-Berndsen, UCD
The quality of unit selection speech synthesisers depends significantly on the content of the speech database being used. In this paper a technique is introduced that can highlight mispronunciations and abnormal units in the speech synthesis voice database through the use of articulatory acoustic feature extraction to obtain an additional layer of annotation. A set of articulatory acoustic feature classifiers help minimise the selection of inappropriate units in the speech database and are shown to significantly improve the word error rate of a diphone synthesiser.
Approaches for adaptive database reduction for Text-To-Speech synthesis
Aleksandra Krul, France Télécom
Géraldine Damnati, France Télécom
François Yvon, GET/ENST and CNRS/LTCI
Cédric Boidin, France Télécom
Thierry Moudenc, France Télécom
This paper raises the issue of speech database reduction adapted to a specific domain for Text-To-Speech (TTS) synthesis application. We evaluate several methods: a database pruning technique based on the statistical behaviour of the unit selection algorithm and a database adaptation method based on the Kullback-Leibler divergence. The aim of the former is to eliminate the least selected units during the synthesis of a domain specific training corpus. The aim of the later approach is to build a reduced database whose unit distribution approximates a given target distribution. We evaluate these methods on several objective measures.
Exploiting Unlabeled Internal Data in Conditional Random Fields to Reduce Word Segmentation Errors for Chinese Texts
Richard Tzong-Han Tsai, Institute of Information Science, Academia Sinica
Hsi-Chuan Hung, Institute of Information Science, Academia Sinica
Hong-Jie Dai, Institute of Information Science, Academia Sinica
Wen-Lian Hsu, Institute of Information Science, Academia Sinica
The application of text-to-speech (TTS) conversion becomes wider. TTS meets some difficulties in Chinese because there are no delimiters between words. Therefore, TTS requires Chinese word segmentation (CWS) as its first key step. However, due to the ambiguous word boundaries in Chinese, CWS systems may generate serious segmentation errors. This causes incorrect interpretation of sentences that leads to TTS errors and stops it from broader usages. Our method merely exploits unlabeled internal data to reduce segmentation errors. To demonstrate our generality, we verify our system on the most credible CWS evaluation-the recent SIGHAN bakeoff. Experimental results show that with only training and unlabeled test data, our approach reduces segmentation errors by averagely 15%. Also, our system achieves a comparable performance to the best CWS system that uses external resources. Further analysis shows that our approach has the potential to be more accurate as the test data size increases.
On the role of spectral dynamics in unit selection speech synthesis
Barry Kirkpatrick, Dublin City University
Darragh O'Brien, Dublin City University
Ronan Scaife, Dublin City University
Andrew Errity, Dublin City University
Cost functions employed in unit selection significantly influence the quality of speech output. Although unit selection can produce very natural sounding speech the quality can be inconsistent and is difficult to guarantee due to discontinuities between incompatible units. The join cost employed in unit selection to measure the suitability of concatenating speech units typically consists of sub costs representing the fundamental frequency and spectrum at the boundaries of each unit. In this study the role of spectral dynamics as a join cost in unit selection synthesis is explored. A number of spectral dynamic measures are tested for the task of detecting discontinuities. Results indicate that spectral dynamic measures correlate with human perception of discontinuity if the features are extracted appropriately. Spectral dynamic mismatch is found to be a source of discontinuity although results suggest this is likely to occur simultaneously with static spectral mismatch.
uGloss: A Framework for Improving Spoken Language Generation Understandability
Brian Langner, CMU
Alan W Black, CMU
Understandable spoken presentation of structured and complex information is a difficult task to do well. As speech synthesis is used in more applications, there is likely to be an increasing requirement to present complex information in an understandable manner. This paper introduces uGloss, a language generation framework designed to influence the understandability of spoken output. We describe relevant factors to its design and provide a general description of our algorithm. We compare our approach to human performance for a straightforward task, and discuss areas of improvement and our future goals for this work.
Combination of LSF and Pole Based Parameter Interpolation for Model Based Diphone Concatenation
Karl Schnell, Institute of Applied Physics, Goethe-University Frankfurt, Max-von-Laue-Str. 1, D-60438 Frankfurt am Main, Germany
Arild Lacroix, Institute of Applied Physics, Goethe-University Frankfurt, Max-von-Laue-Str. 1, D-60438 Frankfurt am Main, Germany
For speech generation using small databases, spectral smoothing at the unit joints is necessary and can be realized by an interpolation of model parameters. For that purpose, the LSF are the best choice from the conventional parameter descriptions. This contribution shows how LSF interpolations can be improved using poles as parameters. The problem of the pole assignment between the two pole configurations at the unit joints is solved by pole tracking of an LSF transition. An inspection of the assignments determined by LSF transitions reveals unfavorable cases which can be corrected. A comparison between the LSF and the pole based interpolations shows that the LSF interpolations can be improved by the corrected pole assignments and by the trajectories of the poles. The investigations are performed using a diphone database which is analyzed by an extended LPC model in lattice structure including vocal tract losses.
Automatic Building of Synthetic Voices from Large Multi-Paragraph Speech Databases
Kishore Prahallad, LTI, Carnegie Mellon University
Arthur Toth, LTI, Carnegie Mellon University
Alan Black, LTI, Carnegie Mellon University
Large multi paragraph speech databases encapsulate prosodic and contextual information beyond the sentence level which could be exploited to build natural sounding voices. This paper discusses our efforts on automatic building of synthetic voices from large multi-paragraph speech databases. We show that the primary issue of segmentation of large speech file could be addressed with modifications to force-alignment technique and that the proposed technique is independent of the duration of the audio file. We also discuss how this framework could be extended to build a large number of voices from public domain large multi-paragraph recordings.
Automatic Phonetic Segmentation of Spanish Emotional Speech
Ascensión Gallardo-Antolín, Universidad Carlos III de Madrid
Roberto Barra, Universidad Politécnica de Madrid
Marc Schröder, DFKI GmbH
Sacha Krstulovic, DFKI GmbH
Juan M. Montero, Universidad Politécnica de Madrid
To achieve high quality synthetic emotional speech, unit-selection is the State-of-Art technique. Nevertheless, expensive phonetic segmentation of a large corpus is needed, and cost-effective automatic techniques should be studied. According to the HMM experiments in this paper: segmentation performance can depend heavily on the segmental or prosodic nature of the intended emotion (segmental emotions are more difficult to segment than prosodic ones), several emotions should be combined to obtain a larger training set, especially when prosodic emotions are involved (this is especially true for small training sets) and a combination of emphatic and non-emphatic emotional recordings (short sentences vs. long paragraphs) can degrade overall performance.
Iterative Unit Selection with Unnatural Prosody Detection
Dacheng Lin, Institute of Computing technology, Chinese Academy of Science
Yong Zhao, Speech Group, Microsoft Research Asia
Frank K. Soong, Speech Group, Microsoft Research Asia
Min Chu, Speech Group, Microsoft Research Asia
Jieyu Zhao, Speech Group, Microsoft Research Asia
Corpus-driven speech synthesis is hampered by the occurrence of occasional glitches which ruin the impression of the whole utterance. We propose an iterative unit selection integrated with an unnatural prosody detection model to identify any unnatural prosody. The system searches an optimal path in the lattice, verifies its naturalness by the unnatural prosody model and replaces the bad section with a better candidate, until it passes the verification test. In light of hypothesis testing, we show this trial-and-error approach takes effective advantage of abundant candidate samples in the database. Also, in contrast to conventional prosody prediction, an unnatural prosody detection model still leaves enough room for the prosody variations. Unnaturalness confidence measures are studied. The combined model can reduce the objective distortion by 16.3%. Perceptual experiments also confirm the proposed approach improves the synthetic speech quality appreciably.