Interspeech 2007 Session TuD.SS: Speech and audio processing for intelligent environments
Tuesday, August 28, 2007
16:00 – 18:00
Astrid Scala 1
Reinhold Haeb-Umbach (University of Paderborn), Zheng-Hua Tan (Aalborg University, Denmark)
More detailed information about this session can be found here.
Ambient telephony: scenarios and research challenges
Aki Härmä, Philips Research
Telecommunications at home is changing rapidly. Many people have moved from the traditional PSTN phone to the mobile phone. Now for increasingly many people Voice-over-IP telephony on a PC platform is becoming the primary technology for voice communications. In this tutorial paper we give an overview of some of the current trends and try to characterize the next generation of home telephony, in particular, the concept of ambient telephony. We give an overview of the research challenges in the development of ambient telephone systems and introduce some potential solutions and scenarios.
Always Listening to You: Creating Exhaustive Audio Database in Home Environments
Yasunari Obuchi, Central Research Laboratory, Hitachi Ltd.
Akio Amano, Central Research Laboratory, Hitachi Ltd.
We describe a novel audio database recorded in home environments. The database contains continuous sounds from morning to evening, no matter what the subject is doing, although some utterances to invoke speech recognition are included. It tells us how often speech interface is used, how speech interface is activated erroneously, and how people speak when they really want to use speech recognition. The database also features array recording to improve the performance of speech/non-speech detection and speech recognition under noisy conditions. Preliminary experiments show that the speech/non-speech detection performance of the trigger-initiated activation system is relatively high, but that of the automatic activation system is not satisfactory. Adopting array-based and F0-based detection algorithms produces a slight rise of the precision/recall curve, but more research is necessary to realize a life with ubiquitous speech interface, in which machines are always listening to you.
Joint Speaker Segmentation, Localization and Identification for Streaming Audio
Joerg Schmalenstroeer, Department of Communications Engineering, University of Paderborn, Germany
Reinhold Haeb-Umbach, Department of Communications Engineering, University of Paderborn, Germany
In this paper we investigate the problem of identifying and localizing speakers with distant microphone arrays, thus extending the classical speaker diarization task to answer the question "who spoke when and where". We consider a streaming audio scenario, where the diarization output is to be generated in realtime with as low latency as possible. Rather than carrying out the individual segmentation and classification tasks (speech detection, change detection, gender/speaker classification) sequentially, we propose a simultaneous segmentation and classification by applying a Viterbi decoder. It uses a transition matrix estimated online from position information and speaker change hypotheses, instead of fixed transition probabilites. This avoids early hard decisions and is shown to outperform the sequential approach.
Active binaural distance estimation for dynamic sources
Yan-Chen Lu, Department of Computer Science, University of Sheffield, Sheffield, UK
Martin Cooke, Department of Computer Science, University of Sheffield, Sheffield, UK
Heidi Christensen, Department of Computer Science, University of Sheffield, Sheffield, UK
A method for estimating sound source distance in dynamic auditory ‘scenes’ using binaural data is presented. The technique requires little prior knowledge of the acoustic environment. It consists of feature extraction for two dynamic distance cues, motion parallax and acoustic τ, coupled with an inference framework for distance estimation. Sequential and non-sequential models are evaluated using simulated anechoic and reverberant spaces. Sequential approaches based on particle filtering more than halve the distance estimation error in all conditions relative to the non-sequential models. These results confirm the value of active behaviour and probabilistic reasoning in auditorily-inspired models of distance perception.
A Packetization and Variable Bitrate Interframe Compression Scheme For Vector Quantizer-Based Distributed Speech Recognition
Bengt J. Borgstrom, University of California, Los Angeles
Abeer Alwan, University of California, Los Angeles
We propose a novel packetization and variable bitrate compression scheme for DSR source coding, based on the Group of Pictures concept from video coding. The proposed algorithm simultaneously packetizes and further compresses source coded features using the high interframe correlation of speech, and is compatible with a variety of VQ-based DSR source coders. The algorithm approximates vector quantizers as Markov Chains, and empirically trains the corresponding probability parameters. Feature frames are then compressed as I-frames, P-frames, or B-frames, using Huffman tables. The proposed scheme can perform lossless compression, but is also robust to lossy compression through VQ pruning or frame puncturing. To illustrate its effectiveness, we applied the proposed algorithm to the ETSI DSR source coder. The algorithm provided compression rates of up to 31.60% with negligible recognition accuracy degradation, and rates of up to 71.15% with performance degradation under 1.0%.
Channel Selection by Class Separability Measures for Automatic Transcriptions on Distant Microphones
Matthias Wölfel, Universität Karlsruhe (TH)
Channel selection is important for automatic speech recognition as the signal quality of one channel might be significantly better than those of the other channels and therefore, microphone array or blind source separation techniques might not lead to improvements over the best single microphone. The mayor challenge, however, is to find this particular channel who is leading to the most accurate classification. In this paper we present a novel channel selection method, based on class separability, to improve multi-source far distance speech-to-text transcriptions. Class separability measures have the advantage, compared to other methods such as the signal to noise ratio (SNR), that they are able to evaluate the channel quality on the actual features of the recognition system. We have evaluated on NIST’s RT-07 development set and observe significant improvements in word accuracy over SNR based channel selection methods. We have also used this technique in NIST’s RT-07 evaluation.
Conversation Detection and Speaker Segmentation in Privacy-Sensitive Situated Speech Data
Danny Wyatt, University of Washington
Tanzeem Choudhury, Intel Research
Jeff Bilmes, University of Washington
We present a privacy sensitive approach for automatically finding multi-person conversations in spontaneous, situated speech data and segmenting those conversations into speaker turns. By recording and using features that are rich enough to capture conversational styles and dynamics but not sufficient for reconstructing intelligible speech, we are able to ensure certain level of privacy to the users. We present empirical validation of our approach on truly situated spontaneous speech. Experimental results show that the conversation finding method presented in this paper outperforms earlier approaches and the speaker segmentation accuracy is a significant improvement to the only other known privacy-sensitive method for speaker segmentation.
Audio-based approaches to head orientation estimation in a smart-room
Alberto Abad, TALP Research Center, Universitat Politècnica de Catalunya (UPC)
Carlos Segura, TALP Research Center, Universitat Politècnica de Catalunya (UPC)
Climent Nadeu, TALP Research Center, Universitat Politècnica de Catalunya (UPC)
Javier Hernando, TALP Research Center, Universitat Politècnica de Catalunya (UPC)
The head orientation of human speakers in a smart-room affects the quality of the signals recorded by far-field microphones, and consequently influences the performance of the technologies deployed based on those signals. Additionally, knowing the orientation in these environments can be useful for the development of several multimodal advanced services, for instance, in microphone network management. Consequently, head orientation estimation has recently become a growing interesting research topic. In this paper, we propose two different approaches to head orientation estimation on the basis of multi-microphone recordings: first, an approach based on the generalization of the well-known SRP-PHAT speaker localization algorithm, and second a new approach based on measurements of the ratio between the high and the low band speech energies. Promising results are obtained in both cases, with a generalized better performance of the algorithms based on speaker localization methods.
Multi-Resolution Soft Features for Channel-Robust Distributed Speech Recognition
Valentin Ion, University of Paderborn, Dept. of Communications Engineering
Reinhold Haeb-Umbach, University of Paderborn, Dept. of Communications Engineering
In this paper we introduce soft features of variable resolution for robust distributed speech recognition over channels exhibiting packet losses. The underlying rationale is that lost feature vectors can never be reconstructed perfectly and therefore reconstruction is carried out at a lower resolution than the resolution of the originally sent features. By doing so, enormous reductions in computational effort can be achieved at a graceful or even no degradation in word accuracy. In experiments conducted on the Aurora II database we obtained for example a reduction of a factor of 30 in computation time for the reconstruction of the soft features without an effect on the word error rate. The proposed method is fully compatible with the ETSI DSR standard, as there are no changes involved in the front-end processing and the transmission format.