Interspeech 2007 Session WeD.O1: Speaker verification & identification III
Wednesday, August 29, 2007
16:00 – 18:00
Michael Wagner (University of Canberra)
A Text-Constrained Prosodic System for Speaker Verification
Elizabeth Shriberg, SRI International
Luciana Ferrer, Stanford University
We describe four improvements to a prosody SVM system, including a new method based on text- and part-of-speech-constrained prosodic features. The improved system shows remarkably good performance on NIST SRE06 data, reducing the error rate of an MLLR system by as much as 23% after combination. In addition, an N-best system analysis using eight systems reveals that the prosody SVM is the third and second most important system for 1- and 8-side training conditions, respectively - providing more complementary information than other state-of-the-art cepstral systems. We conclude that as cepstral systems continue to improve, it should become only more important to develop systems based on higher-level features.
Fusing Acoustic, Phonetic and Data-Driven Systems for Text-Independent Speaker Verification
Asmaa El Hannani, University of Fribourg, Fribourg, Switzerland
Dijana Petrovska-Delacretaz, Institut National des Télécommunication, Evry, France
This paper describes our recent efforts in exploring data-driven high-level features and their combination with low-level spectral features for speaker verification. In particular, we compare the phonetic and data-driven approaches and study their complementarity with short-term acoustic approach. Our objective is to show that data-driven units automatically acquired from the speech data, can be used like phonemes to extract high-level features and to bring complementary speaker-specific information that can therefore provide improvements when fused with acoustic systems. Results obtained on the NIST 2006 Speaker Recognition Evaluation data show that the combination of the phonetic, data-driven and Gaussian Mixture Models (GMM) systems brings a 27% relative reduction of the EER in comparison to the baseline GMM system.
Continuous prosodic features and formant modeling with joint factor analysis for speaker verification
Najim Dehak, Centre de Recherche Informatique de Montreal
Patrick Kenny, Centre de Recherche Informatique de Montreal
Pierre Dumouchel, Centre de Recherche Informatique de Montreal
In this paper, we introduced the use of formants contours with prosodic contours based on pitch and energy for speaker recognition. These contours are modeled on continuous manners by using the Legendre polynomials (LP) on basic unit which represents syllables. The parameters extracted from the LP coefficients plus the syllables duration are modeled with Factor analysis (FA) modeling. The results obtained on the core condition of NIST 2006 SRE show that the use of formant with prosodic information gives an absolute improvement of approximately 3% on equal error rate (EER) compared with the results obtained by prosodic informations alone. However when the formants and the prosodic system scores are fused with cepstral FA system, we obtain equivalent results to the results obtained when we fused the prosodic FA syetem with the same cepstral system. This fusion gives a relative improvement of 8.0% (all trials) and 12.0% (English only) on EER compared to cepstral system alone.
Loquendo - Politecnico di Torino's 2006 NIST Speaker Recognition Evaluation System
Claudio Vair, Loquendo
Daniele Colibro, Loquendo
Fabio Castaldo, Politecnico di Torino
Emanuele Dalmasso, Politecnico di Torino
Pietro Laface, Politecnico di Torino
This paper describes the Loquendo – Politecnico di Torino system evaluated on the 2006 NIST speaker recognition evaluation dataset. This system was among the best participants in this evaluation. It combines the results of two independent GMM systems: a Phonetic GMM and a classical GMM. Both systems rely on an intersession variation compensation approach, performed in the feature domain. It allowed a 30% error rate reduction with respect to our 2005 system. The linear combination of the two GMM engines gives a further 10% error rate reduction. We also report the results of a set of post evaluation experiments, related to the training data for the intersession variation evaluation, both for the telephone and microphone datasets. The approach adopted for the two wire tests is also described, showing the effect of the speaker segmentation component of our system. Finally, we describe how we performed the incremental unsupervised adaptation tests.
A Straightforward and Efficient Implementation of the Factor Analysis Model for Speaker Verification
Driss Matrouf, LIA
Nicolas Scheffer, LIA
Benoit Fauve, UWS
Jean-François Bonastre, LIA
For a few years, the problem of session variability in text-independent automatic speaker verification is being tackled actively. A new paradigm based on a factor analysis model have successfully been applied for this task. While very efficient, its implementation is demanding. In this paper, the algorithms involved in the eigenchannel MAP model are written down for a straightforward implementation, without referring to previous work or complex mathematics. In addition, a different compensation scheme is proposed where the standard GMM likelihood can be used without any modification to obtain good performance (even without the need of score normalization). The use of the compensated supervectors within a SVM classifier through a distance based kernel is also investigated. Experiments results shows an overall 50% relative gain over the standard GMM-UBM system on NIST SRE 2005 and 2006 protocols (both at the DCFmin and EER).
Multi-Modal User Authentication from Video for Mobile or Variable-Environment Applications
Timothy J. Hazen, Massachusetts Institute of Technology
Daniel Schultz, Massachusetts Institute of Technology
In this study, we apply a combination of face and speaker identification techniques to the task of multi-modal (i.e., multi-biometric) user authentication for mobile or variable-environment applications. Audio-visual data was collected using a web camera connected to a laptop computer in three different environments: a quiet indoor office, a busy indoor cafe, and near a noisy outdoor street intersection. Experiments demonstrated the benefits that may be obtained from using a multi-modal approach, even when both input modalities suffer from difficult environmental conditions or a poor match between training and testing conditions. Over twelve different training and testing conditions, user authentication equal error rates were reduced an average of 19% from the best individual biometric in each condition, and 36% from an audio-only system.