Interspeech 2007 Session TuD.O1: Speaker verification & identification I
Type
oral
Date
Tuesday, August 28, 2007
Time
16:00 – 18:00
Room
Elisabeth
Chair
Sadaoki Furui (Department of Computer Science, Tokyo Institute of Technology)
TuD.O1‑1
16:00
A New Kernel for SVM MLLR based Speaker Recognition
Zahi Karam, MIT Lincoln Laboratory and MIT Digital Signal Processing Group
William Campbell, MIT Lincoln Laboratory
Speaker recognition using SVMs with features derived from generative models has been shown to perform well. Typically, a UBM is adapted to each utterance yielding a set of features that are used in an SVM. We consider the case where the UBM is a GMM, and MLLR is used to adapt the means of the UBM. We examine two possible SVM feature expansions that arise in this context: the first, a GMM supervector is constructed by stacking the means of the adapted GMM and the second consists of the elements of the MLLR transform. We examine several kernels associated with these expansions. We show that both expansions are equivalent given the proper choice of kernels. Experiments performed on the NIST SRE 2006 corpus highlight that our choice of kernels, which are motivated by distance metrics between GMMs, outperform ad-hoc ones. We also apply NAP to the kernels for channel compensation and show that, with a proper choice of kernel, we achieve results comparable to existing SVM based recognizers.
TuD.O1‑2
16:20
A GMM-based Probabilistic Sequence Kernel for Speaker Verification
Kong-Aik Lee, Institute for Infocomm Research, Singapore
Changhuai You, Institute for Infocomm Research, Singapore
Haizhou Li, Institute for Infocomm Research, Singapore
Tomi Kinnunen, Institute for Infocomm Research, Singapore
This paper describes the derivation of a sequence kernel that transforms speech utterances into probabilistic vectors for classification in an expanded feature space. The sequence kernel is built upon a set of Gaussian basis functions, where half of the basis functions contain speaker specific information while the other half implicates the common characteristics of the competing background speakers. The idea is similar to that in the Gaussian mixture model –universal background model (GMM-UBM) system, except that the Gaussian densities are treated individually in our proposed sequence kernel, as opposed to two mixtures of Gaussian densities in the GMM-UBM system. The motivation is to exploit the individual Gaussian components for better speaker discrimination. Experiments on NIST 2001 SRE corpus show convincing results for the probabilistic sequence kernel approach.
TuD.O1‑3
16:40
Speaker Recognition using Kernel-PCA and Intersession Variability Modeling
Hagai Aronowitz, IBM T.J. Watson Research Center
This paper presents a new method for text independent speaker recognition. We embed both training and test sessions into a session space. The session space is a direct sum of a common-speaker subspace and a speaker-unique subspace. The common-speaker subspace is Euclidean and is spanned by a set of reference sessions. Kernel-PCA is used to explicitly embed sessions into the common-speaker subspace. The common-speaker subspace typically captures attributes that are common to many speakers. The speaker-unique subspace is the orthogonal complement of the common-speaker subspace and typically captures attributes that are speaker unique. We model intersession variability in the common-speaker subspace, and combine it with the information that exists in the speaker-unique subspace. Our suggested framework leads to a 43.5% reduction in error rate compared to a Gaussian Mixture Model (GMM) baseline.
TuD.O1‑4
17:00
Linear and Non Linear Kernel GMM SuperVector Machines for Speaker Verification
Reda Dehak, LRDE-EPITA FRANCE
Najim Dehak, CRIM, ETS, CANADA
Patrick Kenny, CRIM, CANADA
Pierre Dumouchel, CRIM, ETS, CANADA
This paper presents a comparison between Support Vector Machines (SVM) speaker verification systems based on linear and non linear kernels defined in GMM supervector space. We describe how these kernel functions are related and we show how the nuisance attribute projection (NAP) technique can be used with both of these kernels to deal with the session variability problem. We demonstrate the importance of GMM model normalization (M-Norm) especially for the non linear kernel. All our experiments were performed on the core condition of NIST 2006 speaker recognition evaluation (all trials). Our best results (an equal error rate of 6.3%) were obtained using NAP and GMM model normalization with the non linear kernel.
TuD.O1‑5
17:20
Support Vector Regression for Speaker Verification
Ignacio Lopez-Moreno, Universidad Autonoma de Madrid
Ismael Mateos-Garcia, Universidad Autonoma de Madrid
Daniel Ramos, Universidad Autonoma de Madrid
Joaquin Gonzalez-Rodriguez, Universidad Autonoma de Madrid
This paper explores Support Vector Regression (SVR) as an alternative to the widely-used Support Vector Classification (SVC) in GLDS-based speaker erification. SVR allows the use of a epsilon-insensitive loss function which presents many advantages. First, the optimization of the epsilon parameter adapts the system to the variability of the features extracted from the speech. Second, the approach is robust to outliers when training the speaker models. Finally, SVR training is related to the optimization of the probability of the speaker model given the data. Results are presented using the NIST SRE 2006 protocol, showing that SVR-GLDS yields a relative improvement of 31% in EER compared to SVC-GLDS.
TuD.O1‑6
17:40
Derivative and Parametric Kernels for Speaker Verification
Chris Longworth, Cambridge University Engineering Department
Mark Gales, Cambridge University Engineering Department
The use of Support Vector Machines (SVMs) for speaker verification has become increasingly popular. To handle the dynamic nature of the speech utterances, many SVM-based systems use dynamic kernels. Many of these kernels can be placed into two classes, parametric kernels, where the feature-space consists of parameters from the utterance-dependent model, and derivative kernels, where the derivatives of the utterance log-likelihood with respect to parameters of a generative model are used. This paper contrasts the attributes of these two forms of kernel. Furthermore, the conditions under which the two forms of kernel are identical are described. Two forms of dynamic kernel are examined in detail, based on MLLR-adaptation and mean MAP-adapted models. The performance of these kernels is evaluated on the NIST SRE 2002 dataset. Combining the two forms of kernel together gave a 35% relative reduction in Equal Error Rate compared to the best individual kernel.