Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session ThB.P3a: Grapheme-to-phoneme conversion

Type poster
Date Thursday, August 30, 2007
Time 10:00 – 12:00
Room Keurvels
Chair Lori Lamel (CNRS-LIMSI/TLP)


Homograph Ambiguity Resolution in Front-End Design for Portuguese TTS Systems
Daniela Braga, Microsoft Language Development Center
Luís Coelho, Polytechnic Institute of Oporto
Fernando Gil Vianna Resende Jr., Federal University of Rio de Janeiro

In this paper, a module for homograph disambiguation in Portuguese Text-to-Speech (TTS) is proposed. This module works with a part-of-speech (POS) parser, used to disambiguate homographs that belong to different parts-of-speech, and a semantic analyzer, used to disambiguate homographs which belong to the same part-of-speech. The proposed algorithms are meant to solve a significant part of homograph ambiguity in European Portuguese (EP) (106 homograph pairs so far). This system is ready to be integrated in a Letter-to-Sound (LTS) converter. The algorithms were trained and tested with different corpora. The obtained experimental results gave rise to 97.8% of accuracy rate. This methodology is also valid for Brazilian Portuguese (BP), since 95 homographs pairs are exactly the same as in EP. A comparison with a probabilistic approach was also done and results were discussed.

New Word Acquisition Using Subword Modeling
Ghinwa Choueiter, Spoken Language Systems Group, MIT
Stephanie Seneff, Spoken Language Systems Group, MIT
James Glass, Spoken Language Systems Group, MIT

In this paper, we use subword modeling to learn the pronunciations and spellings of new words. The subwords are generated with a context-free grammar, and are intermediate units between phonemes and syllables. We first evaluate the effectiveness of the subword model in automatically generating the spelling and pronunciation of new words. Then the subword model is embedded in a multi-stage recognizer which consists of word, subword, and letter recognizers. In a preliminary set of experiments, the hybrid system outperforms a large-vocabulary isolated word recognizer. The subword model is also used to improve the performance of the letter recognizer by generating a spelling cohort which is used to train a small letter n-gram. The small letter n-gram has a reduced perplexity compared to a much larger n-gram, and can be used by the letter recognizer for the spoken spelling mode. This could translate to an improved letter error rate in future letter recognition experiments.

Language Identification of Person Names using CF-IOF based Weighing Function
Samuel Thomas, IBM India Research Lab, New Delhi
Ashish Verma, IBM India Research Lab, New Delhi

Information about the language of origin helps in generating pronunciation for foreign words, specially person names, in a text-to-speech synthesis system. It can be used to apply language specific letter-to-sound (LTS) rules to these words during synthesis. In this paper, we propose a novel approach for using substrings of a person name (called letter N-grams) to identify the language of its origin. We use a weight for the letter N-grams that is motivated by the techniques used in text document classification, different from the usual N-gram probabilities used in earlier approaches. We also propose a tree based approach to select the letter N-grams of different lengths for language identification. Several experiments have been conducted to evaluate the performance of the proposed approach and compare it with those of the earlier proposed approaches based on N-gram probabilities. We show an improvement in classification results over the earlier approaches without using any language specific rules.

G2P conversion of names. What can we do (better)?
Henk Van den Heuvel, CLST, Radboud University Nijmegen, The Netherlands
Jean-Pierre Martens, ELIS, Ghent University, Belgium
Nanneke Konings, CLST, Radboud University Nijmegen, The Netherlands

In this contribution it is shown that a good approach for the grapheme-to-phoneme conversion of proper names (e.g. person names, toponyms, etc), is to use a cascade of a general purpose grapheme-to-phoneme (G2P) converter and a special purpose phoneme-to-phoneme (P2P) converter. The G2P produces an initial transcription that is then transformed by the P2P. The latter is automatically trained on reference transcriptions of names belonging to the envisaged name category (e.g. toponyms). The P2P learning process is conceived in such a way that it can take account of high order determinants of pronunciation, such as specific syllables, name prefixes and name suffixes. The proposed methodology was successfully tested on person names and toponyms, but we believe that it will also offer substantial reductions of the cost for building pronunciation lexicons of other name categories.

A Learning Method for Thai Phonetization of English Words
Ausdang Thangthai, National Electronics and Computer Technology Center
Chai Wutiwiwatchai, National Electronics and Computer Technology Center
Anocha Ragchatjaroen, National Electronics and Computer Technology Center
Sittipong Saychum, National Electronics and Computer Technology Center

This article tackles the problem of transcribing English words using Thai phonological system. The problem exists in Thai, where modern writing often composes of English orthography, and transcribing using English phonology results unnatural. The proposed model is totally data-driven, starting by automatic grapheme-phoneme alignment, modeling transduction rules and predicting Thai syllabic-tones using learning machines. Three specific issues are addresses. The first one is involving English transcription information in transduction once the input English word appears in an English pronunciation dictionary. Second, more precise transduction rules can be obtained by a constraint of Thai syllable-structure. Lastly, the ambiguity in assigning tones to Thai pronunciations of English words is alleviated by introducing a learning machine. The proposed model achieves acceptable results in both objective and text-to-speech synthesis subjective tests.

Spontaneous Speech Synthesis by Pronunciation Variant Selection - A Comparison to Natural Speech
Steffen Werner, Dresden University of Technology
Rüdiger Hoffmann, Dresden University of Technology

In order to make synthetic speech more spontaneous we have introduced various duration control methods, which are based on word language model probability and on pronunciation variant selection algorithms. In this paper we combine the change of the speaking rate according to the language model probability with an indirect change of the speaking rate. The latter is achieved by a pronunciation variant selection algorithm based on a variant sequence model. To evaluate the quality of the different approaches and to compare them to the canonical synthesis, we performed various absolute category rating listening tests. In addition, we conducted the same test with natural speech to provide a further evaluation criterion. The results achieved in this paper show that a suitable sequence of pronunciation variants achieves a significant lower listening effort and a higher MOS for both synthetic and natural speech samples compared to the canonical ones.

A Generic Methodology Of Converting Transliterated Text To Phonetic Strings Case Study: Greeklish
Nikos Tsourakis, Dialogos Speech Communications S.A.
Vassilis Digalakis, Technical University of Crete

In this work, we present a generic methodology for converting transliterated text (native language written with a non-native alphabet) to phonetic sequences. The goal is to create the same phonetic result that would be produced if a native speaker uttered the original text in native alphabet. In our work, we implemented the specific methodology as a front-end to a Text-to-Speech (TTS) server. To evaluate our algorithms we considered the case that corresponds to the Greek language, called Greeklish.

Probabilistic Deduction of Symbol Mappings for Extension of Lexicons
Rita Singh, Carnegie Mellon University
Evandro Gouvea, Carnegie Mellon University
Bhiksha Raj, Mitsubishi Electric Research Labs

This paper proposes a statistical mapping-based technique for guessing pronunciations of novel words from their spellings. The technique is based on the automatic determination and utilization of unidirectional mappings between n-tuples of characters and n-tuples of phonemes, and may be viewed as a statistical extension of analogy-based pronunciation guessing algorithms.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo