Interspeech 2007 Session ThC.P1b: Voice conversion and modification
Type
poster
Date
Thursday, August 30, 2007
Time
13:30 – 15:30
Room
Foyer
Chair
Frank Soong (Microsoft Research Asia)
ThC.P1b‑1
F0 Transformation within the Voice Conversion Framework
Zdenek Hanzlicek, Department of Cybernetics, University of West Bohemia, Pilsen, Czech Republic
Jindrich Matousek, Department of Cybernetics, University of West Bohemia, Pilsen, Czech Republic
In this paper, several experiments on F0 transformation within the voice conversion framework are presented. The conversion system is based on a probabilistic transformation of line spectral frequencies and residual prediction. Three probabilistic methods of instantaneous F0 transformation are described and compared. Moreover, a new modification of inter-speaker residual prediction is proposed which utilizes the information on target F0 directly during the determination of suitable residuum. Preference listening tests confirmed that this modification outperformed the standard version of residual prediction.
ThC.P1b‑2
Weighted Frequency Warping for Voice Conversion
Daniel Erro, Universitat Politècnica de Catalunya (UPC)
Asunción Moreno, Universitat Politècnica de Catalunya (UPC)
This paper presents a new voice conversion method called Weighted Frequency Warping (WTW), which combines the well known GMM approach and the frequency warping approach. The harmonic plus stochastic model has been used to analyze, modify and synthesize the speech signal. Special phase manipulation procedures have been designed to allow the system to work in pitch-asynchronous mode. The experiments show that the proposed technique reaches a high degree of similarity between the converted and target speakers, and the naturalness and quality of the resynthesized speech is much higher than those of classical GMM-based systems.
ThC.P1b‑3
Frame Alignment Method for Cross-lingual Voice Conversion
Daniel Erro, Universitat Politècnica de Catalunya (UPC)
Asunción Moreno, Universitat Politècnica de Catalunya (UPC)
Most of the existing voice conversion methods calculate the optimal transformation function from a given set of paired acoustic vectors of the source and target speakers. The alignment of the phonetically equivalent source and target frames is problematic when the training corpus available is not parallel, although this is the most realistic situation. The alignment task is even more difficult in cross-lingual applications because the phoneme sets may be different in the involved languages. In this paper, a new iterative alignment method based on acoustic distances is proposed. The method is shown to be suitable for text-independent and cross-lingual voice conversion, and the conversion scores obtained in our evaluation experiments are not far from the performance achieved by using parallel training corpora.
ThC.P1b‑4
Voicing Level Control with Application in Voice Conversion
Jani Nurminen, Nokia Technology Platforms
Jilei Tian, Nokia Research Center
Victor Popa, Nokia Research Center
Speech processing related changes in the speech spectra may often lead to unwanted changes in the effective degree of voicing, which in turn may degrade the speech quality. This phenomenon is studied more closely in this paper, first on a theoretical level and then in the context of voice conversion. Moreover, a simple but efficient approach for avoiding the unwanted changes in the effective level of voicing is proposed. The usefulness of the proposed voicing level control is demonstrated in a practical voice conversion system. The compensation of the changes in the degree of voicing is found to reduce the average level of noise in the output and to enhance the perceptual speech quality.
ThC.P1b‑5
New Algorithm for LPC Residual Estimation from LSF Vectors for a Voice Conversion System
Winston Percybrooks, Georgia Institute of Technology, Savannah, Georgia, USA
Elliot Moore, Georgia Institute of Technology, Savannah, Georgia, USA
Voice conversion involves transforming segments of speech from a source speaker to make them to be perceived as if spoken by a target speaker. Generally, this process involves the estimation of vocal tract parameters and an excitation signal that match the target speaker. The work presented here proposes an algorithm for estimating the excitation residuals of the target speaker using a weighted combination of clustered residuals. The algorithm is subjected to objective and subjective comparisons to other basic types of residual estimation techniques for voice conversion. Tests were carried on 2 male and 2 female target speakers in an ideal setting. The overall goal of this work is to create an improved algorithm for estimating excitation residuals during voice conversion that maintain speaker recognizability and high synthesis quality.
ThC.P1b‑6
Speaker Adaptive Training for One-to-Many Eigenvoice Conversion Based on Gaussian Mixture Model
Yamato Ohtani, Nara Institute of Science and Technology
Tomoki Toda, Nara Institute of Science and Technology
Hiroshi Saruwatari, Nara Institute of Science and Technology
Kiyohiro Shikano, Nara Institute of Science and Technology
One-to-many eigenvoice conversion (EVC) allows the conversion for a specific source speaker into arbitrary target speakers. Eigenvoice Gaussian mixture model (EV-GMM) is trained in advance with multiple parallel data sets consisting of the source speaker and many pre-stored target speakers. The EV-GMM is adapted for arbitrary target speakers using only a few utterances by estimating a small number of free parameters. Therefore, the initial EV-GMM directly affects the conversion performance of the adapted EV-GMM. In order to prepare a better initial model, this paper proposes Speaker Adaptive Training~(SAT) of a canonical EV-GMM in one-to-many EVC. Results of objective and subjective evaluations demonstrate that SAT causes significant improvements of the performance of EVC.
ThC.P1b‑7
Improving the Phase Vocoder Approach to Pitch Shifting
Petko N Petkov, Global IP Solutions, Sweden
Bastiaan W. Kleijn, Royal Institute of Technology, Sweden
A class of methods known as phase vocoders allows for implementing pitch shifting in the spectral domain. We extend the approach of shifting the isolated harmonics of the spectrum by introducing a new technique for separating the sinusoidal components. Keeping together the main lobe and the side lobes, which result from convolution of the harmonics with the spectrum of the analysis window in the Fourier transform, we minimize the leakage of energy and the related phase compensation problems. Furthermore, we integrate a robust enhancement to the update of the phase, based on tracking of the energy envelope. The formant structure of the signal is preserved by means of an all-pole speech production model. The proposed modifications lead to significant improvement of the quality of the pitch-shifted speech.
ThC.P1b‑8
Comparing GMM-based speech transformation systems
Larbi Mesbahi, IRISA / University of Rennes 1 - ENSSAT
Vincent Barreaud, IRISA / University of Rennes 1 - ENSSAT
Olivier Boeffard, IRISA / University of Rennes 1 - ENSSAT
This article deals with a study on GMM-based voice conversion systems. We compare the main linear conversion functions found in the literature on an identical speech corpus. We insist in particular on the risks of over-fitting}and over-smoothing. We propose three alternatives for robust conversion functions in order to minimize these risks. We show, on two experimental speech databases, that the approach suggested by Kain remains the more precise but leads to an over-fitting ratio of 1.72%. The alternatives which we propose, present an average degradation of 2.8% for 0.52% a over-fitting ratio.