Interspeech 2007 logo
August 27-31, 2007

Antwerp, Belgium
Antwerp Cathedral tower Antwerp main square Antwerp harbour in twilight
printer friendly version of this page

Interspeech 2007 Session TuD.P1: Speech perception I

Type poster
Date Tuesday, August 28, 2007
Time 16:00 – 18:00
Room Foyer
Chair Bernd Möbius (Institute of Natural Language Processing, Experimental Phonetics Group, Stuttgart)


Spoken word recognition of Chinese homophones: A further investigation
Michael C. W. Yip, School of Arts & Social Sciences, The Open University of Hong Kong

A cross-modal naming experiment was conducted to examine the effects of context and other lexical information in the processing of Chinese homophones during spoken language comprehension. In this experiment, listeners named aloud a visual probe as fast as they could, at a pre-designated point upon hearing the sentence, which ended with a spoken Chinese homophone. Results further support that prior context has an early effect on the disambiguation of various homophonic meanings, shortly after the acoustic onset of the word. Second, context interacts with frequency of the individual meanings of a homophone during lexical access. Finally, the present results pattern is clearly consistent with the context-dependency hypothesis that selection of the appropriate meaning of an ambiguous word depends on the simultaneous interaction of both sentential and lexical information during lexical access.

The Role of Outer Hair Cell Function in the Perception of Synthetic Versus Natural Speech
Maria Wolters, University of Edinburgh
Pauline Campbell, Queen Margaret University
Christine DePlacido, Queen Margaret University
Amy Liddell, Queen Margaret University
David Owens, Queen Margaret University

Hearing loss as assessed by pure-tone audiometry (PTA) is significantly correlated with the intelligibility of synthetic speech. However, PTA is a subjective audiological measure that assesses the entire auditory pathway and does not discriminate between the different afferent and efferent contributions. In this paper, we focus on one particular aspect of hearing that has been shown to correlate with hearing loss: outer hair cell (OHC) function. One role of OHCs is to increase sensitivity and frequency selectivity. This function of OHCs can be assessed quickly and objectively through otoacoustic emissions (OAE) testing, which is little known outside the field of audiology. We find that OHC function affects the perception of human speech, but not that of synthetic speech. This has important implications not just for audiological and electrophysiological research, but also for adapting speech synthesis to ageing ears.

Hybridizing Conversational and Clear Speech
Akiko Kusumoto, CSLU-OGI at OHSU
Alexander B. Kain, CSLU-OGI at OHSU
John-Paul Hosom, CSLU-OGI at OHSU
Jan P. H. van Santen, CSLU-OGI at OHSU

"Clear" (CLR) speech is a speaking style that speakers adopt to be understood correctly in a difficult communication environment. Studies have shown that CLR speech, as opposed to "conversational" (CNV) speech, has significantly higher intelligibility in various conditions. While many differences in acoustic features have been identified, it is not known which individual feature or combinations of features cause the higher intelligibility of CLR speech. The objectives of the current study are to examine whether it is possible to improve speech intelligibility by approximating CLR speech features and to determine which acoustic features contribute to intelligibility. Our approach creates speech samples that combine acoustic features of CNV and CLR speech, using a hybridization algorithm. Results with normal-hearing listeners showed significant sentence-level intelligibility improvements of 11-23% over CNV speech when replacing certain acoustic features with those from CLR speech.

Neighborhood density and neighborhood frequency effects in French spoken word recognition
Sophie Dufour, Laboratoire de psycholinguistique expérimentale, Geneva University, Switzerland
Ulrich Hans Frauenfelder, Laboratoire de psycholinguistique expérimentale, Geneva University, Switzerland

According to activation-based models of spoken word recognition, words with many and high frequency neighbors are processed more slowly than words with few and low frequency neighbors. Because empirical support for inhibitory neighborhood effects comes mainly from studies conducted in English, the effects of neighborhood density and neighborhood frequency were examined in French language. As typically observed in English, we found that words residing in dense neighborhoods are recognized slower than words residing in sparse neighborhoods. Moreover, we showed that words with higher frequency neighbors are processed more slowly than words with no higher frequency neighbors. Implications of theses results for spoken word recognition are discussed.

Discrimination and Recognition of Scaled Word Sounds
Toshio Irino, Faculty of Systems Engineering, Wakayama University, Japan
Yoshie Aoki, Faculty of Systems Engineering, Wakayama University, Japan
Yoshie Hayashi, Faculty of Systems Engineering, Wakayama University, Japan
Hideki Kawahara, Faculty of Systems Engineering, Wakayama University, Japan
Roy Patterson, CNBH, Dept. of Physiology, Development, and Neuroscience, Cambridge University, UK

Smith et al. (2005) and Ives et al. (2005) demonstrated that humans could extract information about the size of a speaker's vocal tract from speech sounds (vowels and syllables, respectively). We have extended their discrimination and recognition experiments to naturally pronounced words. The Just Noticeable Difference (JND) for size discrimination was between 5.5% and 19% depending on the listener. The smallest JND is comparable to that of the syllable experiments; the average JND is comparable to that of the vowel experiments. The word recognition scores remain above 50% for speaker sizes beyond the normal range for humans. The fact that good performance extends over such a large range of acoustic scales supports Irino and Patterson’s hypothesis (2002) that the auditory system segregates size and shape information at an early stage in the processing.

Benchmarking Human Performance on the Acoustic and Linguistic Subtasks of ASR Systems
Laszlo Toth, Research Group on Artificial Intelligence

Many believe that comparisons of machine and human speech recognition could help determine both the room for and the direction of improvement for speech recognizers. Yet, such experiments are made quite rarely or over such complex domains where instructive conclusions are hard to draw. In this paper we attempt to measure human performance on the tasks of the acoustic and language models of ASR systems separately. To simulate the task of acoustic decoding, subjects were instructed to phonetically transcribe short nonsense sentences. Here, besides the well-known superior segment classification, we also observed a good performance in word segmentation. To imitate higher-level processing, the subjects had to correct deliberately corrupted texts. Here we found that humans can achieve a word accuracy of about 80% even when almost one third of the phonemes are incorrect, and that with word boundary position information the word error rate roughly halves.

Contributions of Temporal Fine Structure Cues to Chinese Speech Recognition in Cochlear Implant Simulation
Lin Yang, ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences
Jianping Zhang, ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences
Yonghong Yan, ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences

This study evaluated the relative contributions of temporal fine structure cues in different frequency bands to Mandarin speech recognition both in quiet and in noise. Chinese tone, vowel, consonant and sentence recognition scores were measured in a 4-channel continuous interleaved sampling (CIS) simulation model with six kinds of carriers: all noise carriers (N1234), all fine structure carriers (F1234) and fine structure carrier in one channel while noise carriers in the others (F1N234, F2N134, F3N124, F4N123). Results showed that low-frequency fine structure below 400 Hz contributed significantly to tone recognition, while mid-frequency fine structure from 400 to 1000 Hz contributed most to vowel and consonant recognition in quiet. But in severe noise it was the common contributions of the temporal fine structure in each band that improved the recognition performance of vowel and consonant significantly. For sentence recognition tones contributed most compared to vowel and consonant.

Effect of number of masking talkers on speech-on-speech masking in Chinese
Xihong Wu, State Key Laboratory on Machine Perception, Speech and Hearing Research Center, Peking University
Jing Chen, PhD candidate of State Key Laboratory on Machine Perception, Speech and Hearing Research Center, Peking University
Zhigang Yang, PhD candidate Department of Psychology, Peking University
Qiang Huang, Master candidate PhD candidate of State Key Laboratory on Machine Perception, Speech and Hearing Research Center, Peking University
Mengyuan Wang, PhD candidate of Department of Psychology, Peking University
Liang Li, Professor of Department of Psychology, Peking University

In this study, targets were nonsense sentences spoken by a Chinese female, and maskers were nonsense sentences spoken by other 1, 2, 3, or 4 Chinese females. All stimuli were presented by two spatially separated loudspeakers. Using the precedence effect, manipulation of the delay between the two loudspeakers for the masker determined whether the target and masker were perceived as coming from the same or different locations. The results show that the masking effect remarkably increased with the number of masking talkers increased progressively from 1 to 4, which is also confirmed by the calculation of the speech intelligibility index. However, the perceived spatial separation, which predominantly reduced informational masking, caused the largest improvement in speech identification with the two-talker masker, indicating that two-voice speech had the highest informational masking impact. Some differences between Chinese speech masking and English speech masking were discussed.

Do different boundary types induce subtle acoustic cues to which French listeners are sensitive?
Odile Bagou, FPSE, Lab. de Psycholinguistique Expérimentale, Université de Genève
Sophie Dufour, FPSE, Lab. de Psycholinguistique Expérimentale, Université de Genève
Cécile Fougeron, Lab. de Phonétique et Phonologie (UMR 7018) CNRS/Sorbonne Nouvelle, Paris3
Alain Content, Laboratoire de Psychologie Expérimentale, U. Libre de Bruxelles
Ulrich, H. Frauenfelder, FPSE, Lab. de Psycholinguistique Expérimentale, Université de Genève

This paper examines the production of perception of three types of phonological boundaries. In the first part, we extended our previous acoustic analysis to confirm that French speakers mark word and syllables boundaries differently in enchaînement sequences. The durational properties of vowels and consonants were compared in 3 boundary conditions: (A) enchaînement (V1C#V2), (B) word-initial consonant (V1#CV2), (C) syllable onset consonant (V1.CV2). Results showed that the three boundary conditions are varying in subtle durational differences on V1 and C. In the second part, the sensitivity of French listeners to these acoustic cues was evaluated. Preliminary results showed that participants are sensitive to durational differences, at least for discriminating between syllable and word boundaries. Implications of these results for lexical segmentation are discussed.

An Information Theoretic Approach to Predict Speech Intelligibility for Listeners with Normal and Impaired Hearing
Svante Stadler, Sound and Image Processing Lab, Royal Institute of Technology,Stockholm, Sweden
Arne Leijon, Sound and Image Processing Lab, Royal Institute of Technology,Stockholm, Sweden
Björn Hagerman, Unit of Technical and Clinical Audiology, Karolinska Institutet, Danderyd, Sweden

A computational method to predict speech intelligibility in noisy environments has been developed. By modeling speech and noise as stochastic signals, the information transmission through a given auditory model can be estimated. Rate-distortion theory is then applied to predict speech recognition performance. Results are compared with subjective tests on normal and hearing impaired listeners. It is found that the method underestimates the supra-threshold deficits of hearing impairment, which is believed to be due to an overly simple auditory model and a small dictionary size.

Speaking rate effects in a landmark-based phonetic exemplar model
Travis Wade, Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Germany
Bernd Möbius, Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Germany

In this study we describe a model of speech perception in which neither speaking rate nor lower level temporal cues are considered explicitly. Instead, newly encountered speech signals are encoded as sequences of detailed acoustic events specified in real time at salient landmarks and compared directly with previously heard patterns. When presented with obstruent-vowel sequences occurring in the TIMIT database, the model performs similarly to humans in relying on temporal information for consonant and vowel recognition—and interpreting this information in a rate-dependent manner—when non-temporal cues are ambiguous; and by being adversely affected by local rate variability. These results indicate that compensation for speaking rate in human perception may follow implicitly from even modest knowledge of the robust correlations between temporal and other properties of individual speech events and those of their surrounding contexts, and do not require special normalization processes.

Acoustic correlates of intelligibility enhancements in clearly produced fricatives
Kazumi Maniwa, Department of Linguistics, University of Konstanz, Germany
Allard Jongman, Department of Linguistics, University of Kansas, U.S.A.
Travis Wade, Institute for Natural Language Processing, University of Stuttgart, Germany

Two experiments investigated whether and how clear speech production enhances intelligibility of English fricatives for normal-hearing listeners and listeners with simulated hearing impairment. Babble thresholds were measured for minimal pair distinctions. Clear speech benefited both groups overall; however, for impaired listeners, the clear speech effect held only for sibilant pairs. Correlation analyses comparing acoustic and perceptual data indicated that a shift of energy concentration toward higher frequency regions and greater source strength contributed to the clear speech effect for normal-hearing listeners, while listeners with simulated loss seemed to benefit mostly from cues involving lower frequency regions.

Modelling the Human-machine Gap in Speech Reception: Microscopic Speech Intelligibility Prediction for Normal-hearing Subjects with an Auditory Model
Tim Jürgens, Institute of Physics, Carl-von-Ossietzky University Oldenburg, Germany
Thomas Brand, Institute of Physics, Carl-von-Ossietzky University Oldenburg, Germany
Birger Kollmeier, Institute of Physics, Carl-von-Ossietzky University Oldenburg, Germany

Speech intelligibility in noise for normal-hearing subjects is predicted by a model that consists of an auditory preprocessing and a speech recognizer. Using a highly systematic logatome speech corpus allows the analysis of response rates and confusions of single phonemes. The predicted data is validated by listening tests. If testing utterances that are not identical to those in training material are used, the psychometric function in noise is predicted with an offset of +12 dB SNR. This is consistent with the man-machine performance gap when comparing human with automatic speech recognition. This offset reduces to nearly 0 dB in a second model design where identical recordings for training and testing are used. This underlines the “optimal detector” concept required to model human speech perception assuming that the “world knowledge” yields an optimal template in each listening experiment. Furthermore predicted confusion matrices are compared to those of normal-hearing subjects.

Lombard Speech Impact on Perceptual Speaker Recognition
Ayako Ikeno, Center for Robust Speech Systems (CRSS), University of Texas at Dallas
John H.L. Hansen, Center for Robust Speech Systems (CRSS), University of Texas at Dallas

The goal in this study is to investigate how Lombard effect impacts perceptual speaker recognition. We report results from In-Set/Out-of-Set speaker identification (ID) tasks performed by human subjects with a comparison to automatic algorithms. The main trends show that mismatch in reference and test data causes a significant decrease in speaker ID accuracy. The results also indicate that Lombard speech contributes to higher accuracy for In-Set speaker ID, but interferes with correct detection of Out-of-Set speakers. In addition, it is observed that the mismatched conditions cause a higher false reject rate, and that the matched conditions result in higher false acceptance. We further discuss automated system performance in comparison to human performance. Overall observations suggest that deeper understanding of cognitive factors involved in perceptual speaker ID offers meaningful insights for further development of automatic systems and combined automatic-human based systems.

Effect of Within- and Between-talker Variability on Word Identification in Noise by Younger and Older Adults
Huiwen Goy, Department of Psychology, University of Toronto, Mississauga, Canada
Kathy Pichora-Fuller, Department of Psychology, University of Toronto, Mississauga, Canada; Toronto Rehabilitation Institute, Toronto, Canada
Pascal van Lieshout, Department of Psychology, University of Toronto, Mississauga, Canada; Department of Speech Pathology, University of Toronto, Toronto, Canada; Toronto Rehabilitation Institute, Toronto, Canada; IBBME, University of Toronto, Toronto, Canada
Gurjit Singh, Department of Psychology, University of Toronto, Mississauga, Canada; Toronto Rehabilitation Institute, Toronto, Canada
Bruce Schneider, Department of Psychology, University of Toronto, Mississauga, Canada

Talkers alter their speech in noisy environments yet most speech-in-noise testing uses materials recorded in quiet. Sentences from a common test (SPIN-R) were recorded by a new talker in different talking conditions and the original and new materials were used to test word identification accuracy in younger and older adults. Inter- and intra-talker differences affected performance. Intelligibility was better for materials heard in noise when the materials were spoken in noise, or when the talker was asked to speak loudly, especially by older listeners. The most likely acoustical explanation for the inter-talker difference seems to be increased intensity and duration in the production of the sentence-final target words.

Speech Perception in Children with Speech Sound Disorder
H. Timothy Bunnell, Nemours Biomedical Research
N. Carolyn Schanen, Nemours Biomedical Research
Linda Vallino, Nemours Biomedical Research
Thierry Morlet, Nemours Biomedical Research
James Polikoff, Nemours Biomedical Research
Jennette Driscoll, Nemours Biomedical Research
James Mantell, Nemours Biomedical Research

This paper describes preliminary results from an ongoing study designed to characterize acoustic-phonetic phenotypes in children with speech delay (SD) of unknown origin. Here we present data on 13 SD children and their siblings (26 children in all) from two speech perception tasks: a two-alternative forced choice categorical perception (ID) task, and an error monitoring (EM) task. In the ID task, minimally differing words (e.g., goat – coat) were used to create 9-step synthetic continua. For the EM task children hear both correctly and incorrectly articulated words and indicated whether the word was correct or not. Some word tokens in this task were produced by the SD children identified as probands in this study. On both tasks, SD children performed more poorly than their non-SD siblings, showing more gradual slopes in their ID functions, and less accuracy in identifying correct versus error productions.

Speech coding and information processing by auditory neurons
Huan Wang, Infineon Technologies, Munich and Technical University, Munich
Werner Hemmert, Bernstein Center for Computational Neuroscience, Munich

One fundamental difference between information processing in the auditory pathway and automatic speech recognition (ASR) systems lies in the coding and processing of nerve-action potentials. Spike trains code amplitude information by means of a rate-code but most information is carried by precise spike timing. In this paper we focus on neurons located in the ventral cochlear nucleus (VCN), which get direct input from primary auditory nerve fibers (ANF). We generate spike trains of the ANFs and VCN neurons with our inner ear model and calculate the transmitted information using a vowel as input stimulus. The spectral information of sound signals is well reflected in the rate-place code of ANFs and VCN neurons, however, the major part of the information (about 90%) is carried by spike timing. We conclude that we should not neglect this fine-grained temporal information for automatic speech recognition.

What do listeners attend to in hearing prosodic structures? Investigating the human speech-parser using short-term recall
Annie C. Gilbert, Laboratoire de Sciences Phonétiques, Université de Montréal
Victor J. Boucher, Laboratoire de Sciences Phonétiques, Université de Montréal

This study examines how heard prosodic patterns are parsed by reference to a principle of focus of attention. According to this principle, attention holds up to four items at once, and the same upper limit appears to apply to the number of syllables in rhythm groups. On this basis it was predicted that in recalling heard prosodic structures, listeners would attend primarily to rhythm groups. 31 Ss were asked to recall the prosody of heard series of [pa] bearing various intonation groups and repetitive or varying rhythms. Exp. 1 showed how the focus of attention can shift when rhythm patterns are repetitive. However, Exp. 2 showed that listeners focus on rhythm when patterns vary (as in speech). The results bear implications on explaining the role of prosodic groups in speech.

ISCA logo Universiteit Antwerpen logo Radboud University Nijmegen logo Katholieke Universiteit Leuven logo