Improved location features for meeting speaker diarization
Scott Otterson, University of Washington
This paper proposes several methods to improve the correlationbased location features recently used in meeting speaker diarization. A speech-specific alternative to the generalized cross correlation phase transform (GCC-PHAT) algorithm is tested and shown to provide equal or better results without noise reduction or continuity-enforcing smoothing. The limitations of a single correlation reference waveform are discused, and it is shown how a multi-band energy ratio feature can help overcome them, yielding significantly imrpoved performance. An all-pairs correlation is also proposed, and when combined with energy ratios, it also improves upon the baseline system. However, the best combination is the baseline correlation features with energy ratios.
A Robust Stopping Criterion for Agglomerative Hierarchical Clustering in a Speaker Diarization System
Kyu Han, Univ. of Southern California
Shrikanth Narayanan, Univ. of Southern California
Agglomerative hierarchical clustering (AHC) is an unsupervised classification strategy of merging the closest pair of clusters recursively, and has been widely used in speaker diarization systems to classify speech segments by speaker identity. The most critical part in AHC is how to automatically stop the recursive process at the point when clustering error rate reaches its lowest possible value, for which a BIC-based stopping criterion has been widely used. However, this criterion is not robust to data source variation. In this paper, we examine the criterion to establish the cause for the robustness issue and, based on this, propose an improved stopping criterion. Experimental results based on meeting conversation excerpts randomly chosen from various meeting speech corpora indicate that the proposed criterion is superior to the BIC-based one, showing that clustering error rate is improved on average by 7.28% (absolute) and 34.16% (relative).
The Blame Game: Performance Analysis of Speaker Diarization System Components
Marijn Huijbregts, University of Twente, International Computer Science Institute
Chuck Wooters, International Computer Science Institute
In this paper we discuss the performance analysis of a speaker diarization system similar to the system that was submitted by ICSI at the NIST RT06s evaluation benchmark. The analysis that is based on a series of oracle experiments, provides a good understanding of the performance of each system component on a test set of twelve conference meetings used in previous NIST benchmarks. Our analysis shows that the speech activity detection component contributes most to the total diarization error rate (23%). The lack of ability to model verlapping speech is also a large source of errors (22%) followed by the component that creates the initial system models (15%).
Trainable Speaker Diarization
Hagai Aronowitz, IBM T.J. Watson Research Center
This paper presents a novel framework for speaker diarization. We explicitly model intra-speaker inter-segment variability using a speaker-labeled training corpus and use this modeling to assess the speaker similarity between speech segments. Modeling is done by embedding segments into a segment-space using kernel-PCA, followed by explicit modeling of speaker variability in the segment-space. Our framework leads to a significant improvement in diarization accuracy. Finally, we present a similar method for bandwidth classification.
Improving Speaker Diarization for CHIL Lecture Meetings
Jing Huang, IBM
Etienne Marcheret, IBM
Karthik Visweswariah, IBM
Speaker diarization is often performed before automatic speech recognition (ASR) to label speaker segments. In this paper we present two simple schemes to improve the speaker diarization performance. The first is to iteratively refine GMM speaker models by frame level re-labeling and smoothing of the decision likelihood. The second is to use word level alignment information from the ASR process. We focus on the CHIL lecture meeting data. Our experiments on the NIST RT06 evaluation data show that these simple methods are quite effective in improving our baseline diarization system, with alignment information providing 1% absolute reduction in diarization error rate (DER) and the re-label smoothing providing an additional 3.51% absolute reduction in DER. The overall system generates a DER that is 6% relative better than the top performing system from the RT06 evaluation.
Speaker Diarization using Normalized Cross Likelihood Ratio
Viet-Bac Le, LORIA, Nancy, France
Odile Mella, LORIA, Nancy, France
Dominique Fohr, LORIA, Nancy, France
In this paper, we present the Normalized Cross Likelihood Ratio (NCLR) and the advantages of using it in a speaker diarization system. First, the NCLR is used as a dissimilarity measure between two Gaussian speaker models in the speaker change detection step and its contribution to the performance of speaker change detection is compared with those of BIC and Hostelling’s T2-Statistic measures. Then, the NCLR measure is modified to deal with multi-gaussian adapted models in the cluster recombination step. This step ends the step-by-step speaker diarization process after the BIC-based hierarchical clustering and the Viterbi re-segmentation steps. By comparing the NCLR measure with the CLR (Cross Likelihood Ratio) one, more than 30% of relative diarization error is reduced in ESTER evaluation data.