Interspeech 2007 Session WeB.P1b: Multimodal/multimedia signal processing
Wednesday, August 29, 2007
10:00 – 12:00
Alexandros Potamianos (Technical University of Crete)
Audio-visual phoneme classification for pronunciation training applications
Hedvig Kjellström, Computational Vision and Active Perception Laboratory, CSC, KTH, Stockholm, Sweden
Olov Engwall, Centre for Speech Technology, CSC, KTH, Stockholm, Sweden
Sherif Abdou, Department of IT, Faculty of Computers and Information, Cairo University, Giza, Egypt
Olle Bälter, Human-Computer Interaction Group, CSC, KTH, Stockholm, Sweden
We present a method for audio-visual classification of Swedish phonemes, to be used in computer-assisted pronunciation training. The probabilistic kernel-based method is applied to the audio signal and/or either a principal or an independent component (PCA or ICA) representation of the mouth region in video images. We investigate which representation (PCA or ICA) that may be most suitable and the number of components requiredin the base, in order to be able to automatically detect pronunciation errors in Swedish from audio-visual input. Experiments performed on one speaker show that the visual information help avoiding classification errors that would lead to gravely erroneous feedback to the user; that it is better to perform phoneme classification on audio and video separately and then fuse the results, rather than combining them before classification; and that PCA outperforms ICA for fewer than 50 components.
Visual Information and Redundancy Conveyed by Internal Articulator Dynamics in Synthetic Audiovisual Speech
Katja Grauwinkel, Berlin University of Technology
Britta Dewitt, Berlin University of Technology
Sascha Fagel, Berlin University of Technology
Results of a study on the visual information conveyed by the dynamics of internal articulators are reported. Intelligibility of synthetic audiovisual speech with and without visualization of internal articulator movements was compared. Recognition scores were contrasted before and after a learning lesson in which articulator trajectories were explained, once with and once without motion of internal articulators. Results show that motion information of internal articulator dynamics did not lead to significantly different recognition scores at first, and that only in case of this additional visual information the learning lesson significantly increases visual and audiovisual intelligibility. After the learning lesson with internal articulatory movements the visual recognition could be enhanced more than the audiovisual recognition. It could be shown that this was due to incomplete sensory integration and redundant information conveyed by auditory and visual sources of information.
A Speech Rate Related Lip Movement Model for Speech Animation
Wei Zhou, University of Science & Technology of China
Zengfu Wang, University of Science & Technology of China
A novel lip movement model related to speech rate is proposed in this paper. The model is constructed based on the research results on the viscoelasticity of skin-muscle tissue and the quantitative relationship between lip muscle force and speech rate. In order to show the validity of the model, we have applied it to our Chinese speech animation system. The experimental results show that our system can synthesize the individualized speech animation with high naturalness at different speech rates.
An Extension 2DPCA based Visual Feature Extraction Method for Audio-Visual Speech Recognition
Guanyong Wu, Department of Electronics Engineering, Shanghai Jiaotong University, Shanghai, China. 200240
Jie Zhu, Department of Electronics Engineering, Shanghai Jiaotong University, Shanghai, China. 200240
Two dimensional principal component analysis (2DPCA) has been proposed for face recognition as an alternative to traditional PCA transform . In this paper, we extend this approach to the visual feature extraction for audio-visual speech recognition (AVSR). First, a two-stage 2DPCA transform is conducted to extract the visual features. Then, the visemic linear discriminant analysis (LDA) is applied for post extraction processing. We investigate the presented method comparing with traditional PCA and 2DPCA. Experimental results show that the extension 2DPCA can reduce the dimension of 2DPCA and represent the testing mouth images better than PCA does; Moreover, 2DPCA+LDA needs less computation and has a better performance than PCA+LDA in the visual-only speech recognition; Finally, further experimental results demonstrate that our AVSR system using the extension 2DPCA method provides significant enhancement of robustness in noisy environments compared to the audio-only speech recognition.
Preventing an External Acoustic Noise from being Misrecognized as a Speech Recognition Object by Confirming the Lip Movement Image Signal
Soo-jong Lee, ETRI
This paper describes an attempt to prevent an external acoustic noise from being misrecognized as a speech recognition object by confirming the lip movement image signal of a speaker as well as the analysis of the acoustic energy in the speech activity detection procedure. An image camera for a PC is added to the existing speech recognition environment, and the collected image is analyzed to capture the movement of lips and classify whether it is acoustic speech made by a human or not. We combined a speech recognition processor and an image recognizer, and the interworking function successfully operated at the rate of 99.3%. In the case of a subject facing the image camera and speaking, processing normally progressed to the output of the speech recognition result. However, the speech recognition result was not obtained without facing the camera, since the acoustic energy is regarded as noise if any lip movement is not confirmed.
Automatic Head Motion Prediction from Speech Data
Gregor Hofer, University of Edinburgh
Hiroshi Shimodaira, University of Edinburgh
In this paper we present a novel approach to generate a sequence of head motion units given some speech. The modelling approach is based on the notion that head motion can be divided into a number of short homogeneous units that can each be modelled individually. The system is based on Hidden Markov Models (HMM), which are trained on motion units and act as a sequence generator. They can be evaluated by an accuracy measure. A database of motion capture data was collected and manually annotated for head motion and is used to train the models. It was found that the model is good at distinguishing high activity regions from regions with less activity with accuracies around 75 percent. Furthermore the model is able to distinguish different head motion patterns based on speech features somewhat reliably, with accuracies reaching almost 70 percent.
Omnidirectional Audio-Visual Talker Localizer With Dynamic Feature Fusion Based on Validity and Reliability Criteria
Yuki Denda, Ritsumeikan University
Takanobu Nishiura, Ritsumeikan University
Yoichi Yamashita, Ritsumeikan University
Talker localization is indispensable in video conferencing. Statistical audio-visual (AV) talker localizers that fuse AV features based on prior statistical property are ideals. However, statistical property must be estimated prior to the AV features fusion procedure. To overcome this problem, this paper proposes a novel robust and omnidirectional AV talker localizer that dynamically fuses AV features based on validity and reliability criteria for eliminating prior statistical property. Direction estimation of speech arriving using equilateral triangular microphone array and human position detection using an omnidirectional video camera extract AV features from captured AV signals. Validity criterion, called audio- or visual-localization counter, validates both features. Reliability criterion, called evaluator of directional-speech arriving, acts as weight for dynamic AV features fusion. The results of talker localization experiments in an actual office room confirmed that the proposed AV localizer based on dynamic feature fusion is superior to that of the conventional localizer that utilizes either audio or visual feature.
Processing Image and Audio Information for Recognising Discourse Participation Status through Features of Face and Voice
Nick Campbell, NiCT
Damien Douxchamps, NAIST
This paper describes a system based on a 360-degree camera with a single microphone that detects speech activity in a round-table context for the purpose of estimating discourse participation status information for each member present. We have obtained 97% accuracy in detecting participants and have shown that the use of non-verbal and backchannel speech information is a useful indicator of participant status in a discourse.