Interspeech 2007 Session TuC.P2b: Education and training
Type
poster
Date
Tuesday, August 28, 2007
Time
13:30 – 15:30
Room
Alpaerts
Chair
Mari Ostendorf (University of Washington)
TuC.P2b‑1
Syllable Lattices as a Basis for a Children’s Speech Reading Tracker
Daniel Bolanos, PhD student at CSLR
Wayne Ward, Research Professor and director of the CSLR
Sarel Van Vuuren, Research Associate at the CSLR
Javier Garrido, Associate Professor
In this paper we present an algorithm that makes use of information contained in syllable lattices to significantly reduce the classification error rate of a children’s speech reading tracker. The task is to verify whether each word in a reference string was actually spoken. A syllable graph is generated from the reference word string to represent acceptable pronunciation alternatives. A syllable based continuous speech recognizer is used to generate a syllable lattice. The best alignment between the reference graph and the syllable lattice is determined using a dynamic programming algorithm. The speech vectors that are aligned with each syllable are used as features for Support Vector Machine classifiers that accept or reject each syllable in the aligned path. Experimental results over three children’s speech corpora show that this algorithm substantially reduces the classification error rate over the standard word based tracker and over a simple best-path syllable based tracker.
TuC.P2b‑2
Mandarin Vowel Pronunciation Quality Evaluation by Using Formant Pattern Recognition
Fuping Pan, ThinkIT laboratory, Institute of Acoustics, Chinese Academy of Sciences
Qingwei Zhao, ThinkIT laboratory, Institute of Acoustics, Chinese Academy of Sciences
Yonghong Yan, ThinkIT laboratory, Institute of Acoustics, Chinese Academy of Sciences
In this paper we propose to apply formant pattern recognition to Mandarin vowel pronunciation assessment. We devise a novel pitch cycle detection method and suggest estimating formant frequencies from observations of the frequency domain by using pitch-synchronous analysis. Statistically based classifiers are trained to discriminate formant patterns for vowel pronunciation assessment. Five confusable Mandarin vowels are selected for experiments. Assessment results show an average human-machine score correlation improvement of 6.10% of the new method over ASR technique, and show an average improvement of 6.37% over traditional LPC analyzing method.
TuC.P2b‑3
Automatic Detection and Classification of Disfluent Reading Miscues in Young Children's Speech for the Purpose of Assessment
Matthew Black, Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, CA, USA
Joseph Tepperman, Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, CA, USA
Sungbok Lee, Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, CA, USA
Patti Price, PPrice Speech and Language Technology, Menlo Park, CA, USA
Shrikanth Narayanan, Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, CA, USA
This paper explores the importance of disfluent reading miscues (sounding-out, hesitations, whispering, elongated onsets, question intonations) in automating the assessment of children’s oral word reading tasks. Analysis showed that a significant portion (21%) of the speech obtained from grades K-2 children from predominantly Spanish-speaking families contained at least one disfluent reading miscue. We discovered human evaluators rated the fluency nearly as important as accuracy when judging the overall reading ability of a child. We devised a lexical method for automatically detecting the sounding-out, hesitation, and whispering disfluencies, which achieved a 14.9% missed detection and 8.9% false alarm rate. We were also able to discriminate 69.4% of the sound-outs from other disfluencies with a 28.5% false alarm rate, a promising and novel result.
TuC.P2b‑4
Structural Assessment of Language Learners' Pronunciation
Nobuaki Minematsu, The University of Tokyo
Kei Kamata, The University of Tokyo
Satoshi Asakawa, The University of Tokyo
Takehiko Makino, Chuo University
Tazuko Nishimura, The University of Tokyo
Keikichi Hirose, The University of Tokyo
Speaker-invariant structural representation of speech was proposed, where only the phonic contrasts between speech sounds were extracted to form their external structure. Considering a mapping function between speaker A's acoustic space and B's space, the speech dynamics was mathematically proven to be invariant between the two. This structural and dynamic representation was applied to describe the pronunciation. As the non-linguistic factors were removed, the representation purely focused on the non-nativeness. For vowel learning, it was automatically estimated for each learner which vowels to correct by priority. Unlike the conventional approach, the estimation was done without the direct use of sound substances such as spectrums. In this paper, using the vowel charts of the learners plotted by a phonetician, the validity of this contrastive or relative approach is examined by comparing it with the conventional absolute approach. Results show the high validity of our proposal.
Automatic large-scale oral language proficiency assessment
Febe de Wet, Stellenbosch University Centre for Language and Speech Technology
Christa van der Walt, Department of Curriculum Studies, Stellenbosch University
Thomas Niesler, Department of Electrical and Electronic Engineering, Stellenbosch University
We describe first results obtained during the development of an automatic system for the assessment of spoken English proficiency of university students. The ultimate aim of this system is to allow fast, consistent and objective assessment of oral proficiency for the purpose of placing students in courses appropriate to their language skills. Rate of speech (ROS) was chosen as an indicator of fluency for a number of oral language exercises. In a test involving 106 student subjects, the assessments of 5 human raters are compared with evaluations based on automatically-derived ROS scores. It is found that, although the ROS is estimated accurately, the correlation between human assessments and the ROS scores varies between 0.5 and 0.6. However, the results also indicate that only two of the five human raters were consistent in their appraisals, and that there was only mild inter-rater agreement.