Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session TuC.SS: Synthesis of singing challenge

Type special
Date Tuesday, August 28, 2007
Time 13:30 – 15:30
Room Astrid Scala 1
Chair Gerrit Bloothooft (Utrecht University, The Netherlands)

More detailed information about this session can be found here.


Articulatory synthesis of singing
Peter Birkholz, Institute for Computer Science, University of Rostock

A system for the synthesis of singing on the basis of an articulatory speech synthesizer is presented. To enable the synthesis of singing, the speech synthesizer was extended in many respects. Most importantly, a rule-based transformation of a musical score into a gestural score for articulatory gestures was developed. Furthermore, a pitch-dependent articulation of vowels was implemented. The results of these extensions are demonstrated by the synthesis of a canon "Dona nobis pacem". The two voices in the canon were generated with the same underlying articulatory models and the same musical score, the only difference being that their pitches differ by one octave.

Vocal conversion from speaking voice to singing voice using STRAIGHT
Takeshi Saitou, National Institute of Advanced Industrial Science and Technology
Masataka Goto, National Institute of Advanced Industrial Science and Technology
Masashi Unoki, School of Information Science, Japan Advanced Institute of Science and Technology
Masato Akagi, School of Information Science, Japan Advanced Institute of Science and Technology

A vocal conversion system that can synthesize a singing voice giving a speaking voice and a musicalscore is proposed. It is based on the speech manipulation system STRAIGHT, and comprises three models controlling three acoustic features unique to singing voices: the F0, duration, and spectral envelope. Given the musical score and its tempo, the F0 control model generatesthe F0 contour of the singing voice by controlling four F0 fluctuations: overshoot, vibrato, preparation and fine fluctuation. The duration control model lengthens the duration of each phoneme in the speaking voice by considering the duration of its musical note. The spectral control model converts the spectral envelope of the speaking voice into that of the singing voice by controlling both the singing formant and the amplitude modulation of formants in synchronization with vibrato. Experimental results show that the proposed system can convert speaking voices into singing voices whose quality resembles that of actual singing voices.

Speech to chant transformation with the phase vocoder
Axel Roebel, IRCAM Paris
Joshua Fineberg, Harvard University

The technique used for the composition is a semi automatic system for speech to chant conversion. The transformation is performed using an implementation of shape invariant signal modifications in the phase vocoder and a recent technique for envelope estimation that is denoted as True Envelope estimation. We first describe the compositional idea and give an overview of the preprocessing steps that were required to identify parts of the speech signal that can be used to carry the singing voice. Furthermore we describe the envelope processing that was used to be able to continuously transform the original voice of the actor into different female singing voices.

VOCALOID – Commercial singing synthesizer based on sample concatenation
Hideki Kenmochi, Center for Advanced Sound Technologies, Yamaha
Hayato Ohshita, Center for Advanced Sound Technologies, Yamaha

The song submitted here to the “Synthesis of Singing Challenge” is synthesized by the latest version of the singing synthesizer “Vocaloid”, which is commercially available now. In this paper, we would like to present the overview of Vocaloid, its product lineups, description of each component, and the synthesis technique used in Vocaloid.

RAMCESS/HandSketch: A Multi-Representation Framework for Realtime and Expressive Singing Synthesis
Nicolas D’Alessandro, Faculte Polytechnique de Mons
Thierry Dutoit, Faculte Polytechnique de Mons

In this paper we describe the different investigations that are part of the development of a new singing digital musical instrument, adapted to real-time performance. It concerns improvement of low-level synthesis modules, mapping strategies underlying the development of a coherent and expressive control space, and the building of a concrete bi-manual controller.

Formant-based synthesis of singing
Sten Ternström, KTH Stockholm
Johan Sundberg, KTH Stockholm

Singing synthesis at KTH has its roots in the 1970’s, when Sundberg and Gauffin modified the text-to-speech systems developed by Carlson and Granström. An analogue singing synthesiser called MUSSE was built by Larsson in 1977. It included vibrato and other songspecific features, and could be played with a piano keyboard and joystick, or be remotecontrolled by a minicomputer running a rule system. In the 1990’s, several digital implementations of MUSSE were made by Ternström. The synthesis model described here is a descendant of these, built with Aladdin, a commercial DSP tool that was another outcome of this work (Aladdin Interactive DSP, Hitech Development AB, Täby, Sweden).

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo