Interspeech 2007 Session ThB.SS: Speech recognition by automatic attribute transcription
Thursday, August 30, 2007
10:00 – 12:00
Astrid Scala 1
Chin Hui Lee (Geogia Tech)
More detailed information about this session can be found here.
An Overview on Automatic Speech Attribute Transcription (ASAT)
Chin-Hui Lee, Georgia Institute of Technology
Mark Clements, Georgia Institute of Technology
Sorin Dusan, Rutgers University
Eric Fosler-Lussier, Ohio State University
Keith Johnson, University of California, Berkeley
Biing-Hwang Juang, Georgia Institute of Technology
Lawrence Rabiner, Rutgers University
Automatic Speech Attribute Transcription (ASAT), an ITR project sponsored under the NSF grant (IIS-04-27113), is a cross-institute effort involving Georgia Institute of Technology, The Ohio State University, University of California at Berkeley, and Rutgers University. This project approaches speech recognition from a more linguistic perspective: unlike traditional ASR systems, humans detect acoustic and auditory cues, weigh and combine them to form theories, and then process these cognitive hypotheses until linguistically and pragmatically consistent speech understanding is achieved. A major goal of the ASAT paradigm is to develop a detection-based approach to automatic speech recognition (ASR) based on attribute detection and knowledge integration. We report on progress of the ASAT project, present a sharable platform for community collaboration, and highlight areas of potential interdisciplinary ASR research.
Detection-Based ASR in the Automatic Speech Attribute Transcription Project
Ilana Bromberg, Ohio State University, CSE Dept
Qiang Fu, Georgia Tech, ECE Dept.
Jun Hou, Rutgers University
Jinyu Li, Georgia Tech, ECE Dept
Chengyuan Ma, Georgia Tech, ECE Dept
Brett Matthews, Georgia Tech, ECE Dept
Antonio Moreno-Daniel, Georgia Tech, ECE Dept
Jeremy Morris, Ohio State University, CSE Dept
Sabato Marco Siniscalchi, Georgia Tech, ECE Dept
Yu Tsao, Georgia Tech, ECE Dept
Yu Wang, Ohio State University
We present methods of detector design in the Automatic Speech Attribute Transcription project. This paper details the results of a student-led, cross-site collaboration between Georgia Institute of Technology, Ohio State University and Rutgers University. The work reported in this paper describes and evaluates the detection-based ASR paradigm and discusses phonetic attribute classes, methods of detecting framewise phonetic attributes and methods of combining attribute detectors for ASR. We use Multi-Layer Perceptrons, Hidden Markov Models and Support Vector Machines to compute confidence scores for several prescribed sets of phonetic attribute classes. We use Conditional Random Fields and lattice rescoring to combine framewise detection scores for continuous phone recognition on the TIMIT database. With CRFs, we achieve a phone accuracy of 70.63%, outperforming the baseline and enhanced HMM systems, by incorporating all of the attribute detectors discussed in the paper.
Attribute-based Mandarin Speech Recognition using Conditional Random Fields
Chi-Yueh Lin, Department of Electrical Engineering, National Tsing-Hua University, Hsinchu, Taiwan
Hsiao-Chuan Wang, Department of Electrical Engineering, National Tsing-Hua University, Hsinchu, Taiwan
Integrating phonetic knowledge into a speech recognizer is a possible way to further improve the performance of conventional HMM-based speech recognition methods. This paper presents a cascaded architecture which consists of attribute detection and conditional random field to make use of phonetic knowledge within the phone decoding process. The attribute detection can be implemented by using any effective feature extraction approaches. In this study, an HMM-based method is applied for attribute tagging of Mandarin speech. Then a conditional random field method which applies attribute labels as the input vectors is used to perform the speech recognition. The preliminary experiment result shows that the proposed method is very promising and worthy for further investigation.
Comparing classifiers for pronunciation error detection
Helmer Strik, CLST, Department of Linguistics, Radboud University, Nijmegen, The Netherlands
Khiet Truong, TNO Human Factors, Soesterberg, The Netherlands
Febe de Wet, SU-CLaST, Stellenbosch University, South-Africa
Catia Cucchiarini, CLST, Department of Linguistics, Radboud University, Nijmegen, The Netherlands
Providing feedback on pronunciation errors in computer assisted language learning systems requires that pronunciation errors be detected automatically. In the present study we compare four types of classifiers that can be used for this purpose: two acoustic-phonetic classifiers (one of which employs linear-discriminant analysis (LDA)), a classifier based on cepstral coefficients in combination with LDA, and one based on confidence measures (the so-called Goodness Of Pronunciation scores). The best results were obtained for the two LDA classifiers which produced accuracy levels of about 85-93%.
Using Prosodic And Spectral Characteristics For Sleepiness Detection
Jarek Krajewski, Work and Organizational Psychology
Bernd Kroeger, Clinic of Phoniatrics, Paedaudiology, and Communication Disorders, University Hospital Aachen and Aachen University
This paper describes a promising sleepiness detection approach based on prosodic and spectral speech characteristics and illustrates the validity of this method by briefly discussing results from a sleep deprivation study (N=20). We conducted a within-subject sleep deprivation design. During the night of sleep deprivation, a self-report scale was used every hour just before the recordings to determine the sleepiness state. The speech material consisted of simulated driver assistance system phrases. In order to investigate sleepiness induced speech changes, a standard set of spectral and prosodic features were extracted from the sentences. After forward selection and a PCA were employed on the feature space LDA- and ANN-based classification models were trained. The best level-0 model (RA15, LDA) offers a mean recognition rate of 80.0% for the two-class problem. Using a ensemble classification strategy (majority voting as meta-classifier) we achieved a recognition rate of 88.2%.
Score Fusion for Articulatory Feature Detection
Brian Ore, General Dynamics Advanced Information Systems
Raymond Slyh, AFRL
Articulatory Features (AFs) describe the way in which the speech organs are used when producing speech sounds. Research has shown that incorporating this information into speech recognizers can lead to an increase in system performance. This paper considers English AF detection using Gaussian Mixture Models (GMMs) and Multi-Layer Perceptrons (MLPs). The scores from the GMM- and MLP-based detectors are fused using a second MLP, resulting in an average reduction of 8.24% in equal error rate compared to the individual systems. These detector outputs are used to form the feature set for a Hidden Markov Model (HMM) phone recognizer. It is shown that monophone models created using the proposed feature set perform comparably to triphone models trained using Mel-Frequency Cepstral Coefficients (MFCCs).