Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session TuC.O3: Phonetic segmentation and classification I

Type oral
Date Tuesday, August 28, 2007
Time 13:30 – 15:30
Room Marble
Chair Shigeki Sagayama (The University of Tokyo)

Fixed-Size Kernel Logistic Regression for Phoneme Classification
Peter Karsmakers, IIBT, K.H. Kempen (Associatie KULeuven), B-2440 Geel, Belgium
Kristiaan Pelckmans, ESAT-SCD/SISTA, K.U.Leuven, B-3001 Heverlee, Belgium
Johan Suykens, ESAT-SCD/SISTA, K.U.Leuven, B-3001 Heverlee, Belgium
Hugo Van hamme, ESAT-PSI/SPEECH, K.U.Leuven, B-3001 Heverlee, Belgium

Kernel logistic regression (KLR) is a popular non-linear classification technique. Unlike an empirical risk minimization approach such as employed by Support Vector Machines (SVMs), KLR yield probabilistic outcomes based on a maximum likelihood argument which are particularly important in speech recognition. Different to other KLR implementations we use a Nyström approximation to solve large scale problems with estimation in the primal space such as done in fixed-size Least Squares Support Vector Machines (LS-SVMs). In the speech experiments it is investigated how a natural KLR extension to multi-class classification compares to binary KLR models coupled via a one-versus-one coding scheme. Moreover, a comparison to SVMs is made.
A Multiple-Model Based Framework for Automatic Speech Segmentation
Seung Seop Park, School of Electrical Engineering and INMC
Jong Won Shin, School of Electrical Engineering and INMC
Jong Kyu Kim, School of Electrical Engineering and INMC
Nam Soo Kim, School of Electrical Engineering and INMC

We propose a new approach to automatic speech segmentation for corpus-based speech synthesis. We utilizes multiple independent automatic segmentation machines (ASMs), instead of using a single ASM, to get final segmentation results: Given multiple independent time-marks from various ASMs, we remove biases of the time-marks, and then compute the weighted sum of the bias-removed time-marks. The bias and weight parameters needed for the proposed method are estimated for each phonetic context through a training procedure where manually-segmented results are used as the references. The bias parameters are obtained by averaging the corresponding errors. The weight parameters are simultaneously optimized through the gradient projection method to overcome a set of constraints in the weight parameter space. A decision tree is employed to deal with the useen phonetic contexts. Experimental results show that the proposed method remarkably improves the segmentation accuracy.
Semi-Supervised Learning of Speech Sounds
Aren Jansen, Dept of Computer Science, University of Chicago
Partha Niyogi, Dept of Computer Science, University of Chicago

Recently, there has been much interest in both semi-supervised and manifold learning algorithms, though their applicability has not been explored for all domains. This paper has two goals: (i) to demonstrate semi-supervised approaches based solely on clustering are insufficient for phoneme classification and (ii) to present a new manifold-based semi-supervised algorithm to remedy this shortcoming. The improved performance of our approach over cluster-based methods substantiates the practical relevance of a geometric perspective on speech sounds.
Evaluation of Syllable Stress using Single Class Classifier
Abhinav Parate, IBM India Research Laboratory, New Delhi
Ashish Verma, IBM India Research Laboratory, New Delhi
Jayanta Basak, IBM India Research Laboratory, New Delhi

Evaluation of syllable stress in speech utterances is an important and challenging task in the area of speaker evaluation. In this paper, we propose a method to classify correct utterances of English words based on the evaluation of the lexical syllable stress pattern. Here we use only correctly stressed utterances of the words as training samples since a statistically significant pool of incorrectly stressed utterances is difficult to obtain. The underlying assumption here is that the correct utterances of a word form a compact cluster or a collection of compact clusters (with speaker dependent variations) in a suitably chosen multi-dimensional attribute space. We experimentally demonstrate the effectiveness of the proposed method on several English words and also compare with the standard classifiers where samples from both correct and incorrect utterances were used.
Distinctive Phonetic Feature (DPF) Based Phone Segmentation using Hybrid Neural Networks
Mohammad Nurul Huda, Graduate School of Engineering, Toyohashi University of Technology, Aichi, Japan
Ghulam Muhammad, Graduate School of Engineering, Toyohashi University of Technology, Aichi, Japan
Tsuneo Nitta, Graduate School of Engineering, Toyohashi University of Technology, Aichi, Japan
Junsei Horikawa, Graduate School of Engineering, Toyohashi University of Technology, Aichi, Japan

Segmentation of speech into its corresponding phones has become very important issue in many speech processing areas such as speech recognition, speech analysis, speech synthesis, and speech database. In this paper, for accurate segmentation in speech recognition applications, we introduce Distinctive Phonetic Feature (DPF) based feature extraction using a two-stage NN (Neural Networks) system consists of a RNN (Recurrent Neural Network) in the first stage and an MLN (Multi-Layer Neural Network) in the second stage. The RNN maps continuous acoustic features, Local Feature (LF), onto discrete DPF patterns, while the MLN constraints DPF context or dynamics in an utterance. The proposed DPF based feature extractor provides good segmentation and high recognition rate with a reduced mixture-set of HMMs (Hidden Markov Models) by resolving co-articulation effect.
A methodology for the automatic detection of perceived prominent syllables in spoken French
Jean-Philippe Goldman, University of Geneva, Switzerland
Mathieu Avanzi, University of Neuchâtel, Switzerland
Anne-Catherine Simon, University of Louvain-la-Neuve, Belgium
Anne Lacheret, University of Paris X, France
Antoine Auchlin, University of Geneva

Prosodic transcription of spoken corpora relies mainly on the identification of perceived prominence. However, the manual annotation of prominent phenomena is extremely time-consuming, and varies greatly from one expert to another. Automating this procedure would be of great importance. In this study, we present the first results of a methodology aiming at an automatic detection of prominence syllables. It is based on 1. a spontaneous French corpus that has been manually annotated according to a strict methodology and 2. some acoustic prosodic parameters, shown to be corpus-independent, that are used to detect prominent syllables. Some automatic tools, used to handle large corpora, are also described.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo