Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session WeB.P2: Speaker verification & identification II

Type poster
Date Wednesday, August 29, 2007
Time 10:00 – 12:00
Room Alpaerts
Chair Joseph P. Campbell (MIT Lincoln Laboratory)


Application of Shifted Delta Cepstral Features in Speaker Verification
Jose R. Calvo, Advanced Technologies Application Center, CENATAV, Cuba
Rafael Fernández, Advanced Technologies Application Center, CENATAV, Cuba
Gabriel Hernández, Advanced Technologies Application Center, CENATAV, Cuba

Recently, Shifted Delta Cepstral (SDC) feature was reported to produce superior performance to the delta and delta-delta features in cepstral feature based language identification (LID) systems. This paper examines the application of SDC features in speaker verification and evaluates its robustness to channel mismatch, manner of speaking and session variability. The result of the experiment reflects superior or at least similar performance of SDC regarding delta and delta-delta features in speaker verification.

A Smoothing Kernel for Spatially Related Features and Its Application to Speaker Verification
Luciana Ferrer, Stanford University
Kemal Sonmez, SRI International
Elizabeth Shriberg, SRI International

Most commonly used kernels are invariant to permutations of the feature vector components. This characteristic may make machine learning methods that use such kernels suboptimal in cases where the feature vector has an underlying structure. In this paper we will consider one such case, where the features are spatially related. We show a way to modify the objective function of the support vector machine (SVM) optimization problem to account for this structure. The new optimization problem can be implemented as a standard SVM using a particular smoothing kernel. Results are shown on a speaker verification task using prosodic features that are transformed using a particular implementation of the Fisher score. The proposed method leads to improvements of as much as 15% in equal error rate (EER).

VZ-Norm : an Extension of Z-norm to the Multivariate Case for Anchor Model based Speaker Verification
Delphine Charlet, France Telecom R&D
Mikael Collet, France Telecom R&D
Frédéric Bimbot, IRISA

This paper proposes a vectorial Z-normalization approach, the VZ-norm, which extends Z-normalisation to the multivariate case. It is applied in the framework of Anchor Model (AM) based speaker verification. It experimentally proves to significantly improve performance of anchor models on NIST and ESTER databases. A comparative study of different strategies for computing the covariance matrix involved in the AM VZ-norm is presented and commented.

Word-Conditioned HMM Supervectors for Speaker Recognition
Howard Lei, University of California - Berkeley, and International Computer Science Institute
Nikki Mirghafori, International Computer Science Institute

We improve upon the current Hidden Markov Model (HMM) techniques for speaker recognition by using the means of Gaussian mixture components of keyword HMM states in a support vector machine (SVM) classifier. We achieve an 11% improvement over the traditional keyword HMM approach on SRE06 for the 8 conversation task, using the original set of keywords. Using an expanded set of keywords, we achieve a 4.3% EER standalone on SRE06, and a 2.6% EER in combination with a word-conditioned phone N-grams system, a GMM-based system, and the traditional keyword HMM system on SRE05+06. The latter result improves on our previous best.

Speaker Clustering Using Direct Maximization of A BIC-based Score
Wei-Ho Tsai, Department of Electronic Engineering & Graduate Institute of Computer and Communication Engineering, National Taipei University of Technology, Taipei, Taiwan

This paper presents an effective method for clustering unknown speech utterances based on their associated speakers. The proposed method jointly optimizes the generated clusters and the required number of clusters according to a Bayesian information criterion (BIC). The criterion assesses a partitioning of utterances based on how high the level of within-cluster homogeneity can be achieved at the expense of increasing the number of clusters. Unlike the existing methods, in which BIC is used only to determine the optimal number of clusters, the proposed method uses BIC in conjunction with a genetic algorithm to determine the optimal cluster where each utterance should be located. The experimental results show that the proposed speaker-clustering method outperforms the conventional methods.

Confidence measure based unsupervised target model adaptation for speaker verification
Alexandre Preti, Laboratoire d'Informatique d'Avignon / THALES COMMUNICATIONS
Jean-François Bonastre, Laboratoire d'Informatique d'Avignon
Driss Matrouf, Laboratoire d'Informatique d'Avignon

This paper proposes a new method for updating online the client models of a speaker recognition system. The main idea of the proposed approach is to adapt the client model using all piece of information gathered from the successive test data, without deciding if a test data belongs to the client or to an impostor. The adaptation process includes a weighting scheme of the test data, based on the a posteriori probability that a test belongs to the targeted client model. The proposed approach is evaluated within the framework of the NIST 2005 and 2006 Speaker Recognition Evaluations. The links between the adaptation method and channel mismatch factors is also explored, using both Feature Mapping and Latent Factor Analysis methods. The proposed unsupervised adaptation outperforms the baseline system, with a DCF relative improvement of 27% (37% for EER). When the LFA channel compensation technique is used, the proposed approach achieves a DCF reduction of 20% (12.5% for EER).

Emotion Attribute Projection for Speaker Recognition on Emotional Speech
Huanjun Bao, Center for Speech Technology, Information Science and Technology, Department of Computer Science and Technology, Tsinghua University
Mingxing Xu, Center for Speech Technology, Information Science and Technology, Department of Computer Science and Technology, Tsinghua University
Fang Zheng, Center for Speech Technology, Information Science and Technology, Department of Computer Science and Technology, Tsinghua University

Emotion is one of the important factors that cause the system performance degradation. By analyzing the similarity between channel effect and emotion effect on speaker recognition, an emotion compensation method called emotion attribute projection (EAP) is proposed to alleviate the intra-speaker emotion variability. The use of this method has achieved an equal error rate (EER) reduction of 11.7% with the EER reduced from 9.81% to 8.66%. When a linear fusion based on a GMM-UBM system with an EER of 9.38% and an SVM-EAP system with an EER of 8.66% is adopted, another EER reduction of 22.5% and 16.1% can be further achieved, respectively, and the final EER can be 7.27%.

High-Level Feature-Based Speaker Verification via Articulatory Phonetic-Class Pronunciation Modeling
Shi-Xiong Zhang, The Hong Kong Polytechnic University
Man-Wai Mak, The Hong Kong Polytechnic University
Helen Meng, The Chinese University of Hong Kong

Although articulatory feature-based conditional pronunciation models (AFCPMs) can capture the pronunciation characteristics of speakers, they requires one discrete density function for each phoneme, which may lead to inaccurate models when the amount of training data is limited. This paper proposes a phonetic-class based AFCPM in which the density functions in speaker models are conditioned on phonetic classes instead of phonemes. Phonemes are mapped to phonetic classes by (1) vector quantizing the phoneme-dependent universal background models, (2) grouping phonemes according to the classical phoneme tree, and (3) combination of (1) and (2). A new scoring method that uses an SVM to combine the scores of phonetic-class models is also proposed. Evaluations based on 2000 NIST SRE show that the proposed approach can effectively solve the data sparseness problem encountered in conventional AFCPM.

Direct Acoustic Feature Using Iterative EM Algorithm and Spectral Energy for Classifying Suicidal Speech
Thaweesak Yingthawornsuk, Department of Electrical Engineering and Computer Science, Vanderbilt University, TN, USA, and Department of Electrical Technology Education, KMUTT, Bangkok, Thailand
Hande Kaymaz Keskinpala, Department of Electrical Engineering and Computer Science, Vanderbilt University, TN, USA
Don Mitchell Wilkes, Department of Electrical Engineering and Computer Science, Vanderbilt University, TN, USA
Richard G Shiavi, Department of Electrical Engineering and Computer Science, and Department of Biomedical Engineering, Vanderbilt University, TN, USA
Ronald M Salomon, Department of Psychiatry, Vanderbilt University School of Medicine, TN, USA

Abstract–Research has shown that the voice itself contains important information about immediate psychological state and certain vocal parameters are capable of distinguishing speaking patterns of speech signal affected by emotional disturbances (i.e., clinical depression). In this study, the GMM based feature of the vocal tract system response and spectral energy have been studied and found to be a primary acoustic feature set for separating two groups of female patients carrying a diagnosis of depression and suicidal risk.

On comparing and combining intra-speaker variability compensation and unsupervised model adaptation in speaker verification
Claudio Garreton, Universidad de Chile
Nestor Becerra Yoma, Universidad de Chile
Fernando Huenupan, Universidad de Chile
Carlos Molina, Universidad de Chile

In this paper an unsupervised intra-speaker variability compensation method, ISVC, and unsupervised model adaptation are tested to address the problem of limited enrolling data in text-dependent speaker verification. In contrast to model adaptation methods, ISVC is memoryless with respect to previous verification attempts. As shown here, unsupervised model adaptation can lead to substantial improvements in EER but is highly dependent on the sequence of client/impostor verification events. In adverse scenarios, unsupervised model adaptation might even provide reductions in verification accuracy when compared with the baseline system. In those cases, ISVC may outperform adaptation schemes. It is worth emphasizing that ISVC and unsupervised model adaptation are compatible and the combination of both methods always improves the performance of model adaptation. The combination of both schemes can lead to improvements in EER as high as 34%.

Comparison of Two Kinds of Speaker Location Representation for SVM-based Speaker Verification
Xianyu Zhao, France Telecom R&D Center (Beijing)
Yuan Dong, Beijing University of Posts and Telecommunications
Hao Yang, Beijing University of Posts and Telecommunications
Jian Zhao, Beijing University of Posts and Telecommunications
Liang Lu, Beijing University of Posts and Telecommunications
Haila Wang, France Telecom R&D Center (Beijing)

In anchor modeling, each speaker utterance is represented as a fixed-length location vector in the space of reference speakers by scoring against a set of anchor models. SVM-based speaker verification systems using the anchor location representation have been studied in previously reported work with promising results. In this paper, linear combination weights in reference speaker weighting (RSW) adaptation are explored as an alternative kind of speaker location representation. And this kind of RSW location representation is compared with the anchor location representation in various speaker verification tasks on the 2006 NIST Speaker Recognition Evaluation corpus. Experimental results indicate that with long utterances for reliable maximum likelihood estimation in RSW, the RSW location representation leads to better speaker verification performance than the anchor location; while the latter is more effective for verification of short utterances in high-dimensional representation space.

Jitter and Shimmer Measurements for Speaker Recognition
Mireia Farrús, Universitat Politècnica de Catalunya
Javier Hernando, Universitat Politècnica de Catalunya
Pascual Ejarque, Universitat Politècnica de Catalunya

Jitter and shimmer are measures of the cycle-to-cycle variations of fundamental frequency and amplitude, respectively, which have been largely used for the description of pathological voice quality. Since they characterise some aspects concerning particular voices, it is a priori expected to find differences in the values of jitter and shimmer among speakers. In this paper, several types of jitter and shimmer measurements have been analysed. Experiments performed with the Switchboard-I conversational speech database show that jitter and shimmer measurements give excellent results in speaker verification as complementary features of spectral and prosodic parameters.

Natural-Emotion GMM Transformation Algorithm for Emotional Speaker Recognition
Zhenyu Shan, Zhejiang University
Yingchun Yang, Zhejiang University
Ruizhi Ye, Zhejiang University

One of the largest challenges in speaker recognition is dealing with speaker-emotion variability problem. Nowadays, compensation techniques are the main solutions to this problem. In these methods, all kinds of speakers' emotion speech should be elicited thus it is not user-friendly in the application. Therefore the basic problem is how to get the distribution of speakers' emotion speech and how to train emotion GMM from their natural speech. This paper presents a natural-emotion GMM transformation algorithm to train users' emotion model to overcome this problem. The algorithm can convert natural GMM to emotion GMM based on an emotion database. It only needs speakers' natural speech and needn't to align the natural utterances with the emotion utterances. The performance evaluation is carried on the MASC database. The promising result is achieved compared to the traditional speaker verification.

Optimized One-Bit Quantization for Adapted GMM-Based Speaker Verification
Ivy H. Tseng, University of Southern California
Olivier Verscheure, IBM T.J. Watson Research Center
Deepak S. Turaga, IBM T.J. Watson Research Center
Upendra V. Chaudhari, IBM T.J. Watson Research Center

We tackle the problem of designing the optimized one-bit quantizer for speech cepstral features (MFCCs) in speaker verification systems that use the likelihood ratio test, with Gaussian Mixture Models for likelihood functions, and a Universal Background Model (UBM) with Bayesian adaptation used to derive individual speaker models from the UBM. Unlike prior work, that designed a Minimum Log-Likelihood Ratio Difference (MLLRD) quantizer, we design a new quantizer that explicitly optimizes the desired tradeoff between the probabilities of false alarm and detection, directly in probability space. We analytically derive the optimal reconstruction levels for a one-bit quantizer, given a classification decision threshold, and evaluate its performance for speaker verification on the Switchboard corpus. The designed quantizer shows minimal impact on equal error rate (with an achieved compression ratio of 32) as compared to the original system, and significantly outperforms the MLLRD strategy.

A Comparison of Session Variability Compensation Techniques for SVM-based Speaker Recognition
Mitchell, Leigh McLaren, Queensland University of Technology (QUT), Brisbane, Australia
Robbie Vogt, Queensland University of Technology (QUT), Brisbane, Australia
Brendan Baker, Queensland University of Technology (QUT), Brisbane, Australia
Sridha Sridharan, Queensland University of Technology (QUT), Brisbane, Australia

This paper compares two of the leading techniques for session variability compensation in the context of GMM mean supervector SVM classifiers for speaker recognition: inter-session variability modelling and nuisance attribute projection. The former is incorporated in the GMM model training while the latter is employed as a modified SVM kernel. Results on both the NIST 2005 and 2006 corpora demonstrate the effectiveness of both techniques for reducing the effects of session variation. Further, system- and score-level fusion experiments show that the combination of the two methods provides improved performance.

Influence of task duration in text-independent speaker verification
Benoit Fauve, Speech and Image Research Group, University of Wales Swansea, UK
Nicholas Evans, LIA, Universite d’Avignon et des Pays de Vaucluse, France
Neil Pearson, Speech and Image Research Group, University of Wales Swansea, UK
Jean-François Bonastre, LIA, Universite d’Avignon et des Pays de Vaucluse, France
John Mason, Speech and Image Research Group, University of Wales Swansea, UK

Short duration tasks for text-independent speaker verification have received relatively little attention when compared to that directed at tasks involving many minutes of speech. In this paper we investigate verification performance on a range of durations from a few seconds to a few minutes. We begin with a state-of-the-art GMM-based system operating on a few minutes of speech per person and show that the same system is suboptimal on short (10 seconds) speech recordings. In particular we highlight that optimal frame selection exhibits a dependency on overall duration. This work sheds some light on the difficulties of transposing recent and important techniques such as SVM-NAP to the short duration tasks.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo