Speech Recognition Techniques for a Sign Language Recognition System
Philippe Dreuw, RWTH Aachen University
David Rybach, RWTH Aachen University
Thomas Deselaers, RWTH Aachen University
Morteza Zahedi, RWTH Aachen University
Hermann Ney, RWTH Aachen University
One of the most significant differences between automatic sign language recognition (ASLR) and automatic speech recognition (ASR) is due to the computer vision problems, whereas the corresponding problems in speech signal processing are rather solved due to intensive research in the last 30 years. We present our approach where we start from a large vocabulary speech recognition system to profit from the insights that have been obtained in ASR research. The developed system is able to recognize sentences of continuous sign language independent of the speaker. The used features are obtained from standard video cameras without any special data acquisition devices. In particular, we focus on feature and model combination techniques successfully applied in ASR, and the usage of pronunciation and language models (LM) in sign language. These techniques can be used for all kind of sign language recognition systems, and for many video analysis problems where the temporal context is important, e.g. for action or gesture recognition. On a publicly available benchmark database consisting of 201 sentences and 3 signers, we can achieve a 17% word-error-rate.
FrB.P1a‑2
Impact of Various Small Sound Source Signals on Voice Conversion Accuracy in Speech Communication Aid for Laryngectomees
Keigo Nakamura, Nara Institute of Science and Technology, Japan
Tomoki Toda, Nara Institute of Science and Technology, Japan
Hiroshi Saruwatari, Nara Institute of Science and Technology, Japan
Kiyohiro Shikano, Nara Institute of Science and Technology, Japan
We proposed a speaking aid system using statistical voice conversion for laryngectomees whose vocal folds were extracted. This paper investigates influences on the voice conversion accuracy of various small sound sources to reveal which kind of signals is accepted in our system. Spectral envelopes and power of sound sources are controlled independently. In total, 8 kinds of sound source signals are used for investigating differences of the voice conversion accuracy. Results of objective and subjective evaluations demonstrate that voice conversion can widely accept sound sources with various spectral envelopes and power in a large degree unless the power of them is comparable to that of silence parts.
FrB.P1a‑3
Design and Development of Voice Controlled Aids for Motor-Handicapped Persons
Petr Cerva, Institute of Information Technology and Electronics, Technical University of Liberec, Hálkova 6, 461 17 Liberec, Czech Republic
Jan Nouza, Institute of Information Technology and Electronics, Technical University of Liberec, Hálkova 6, 461 17 Liberec, Czech Republic
In this paper we present two voice-operated systems that have been designed for Czech motor-handicapped people to allow them full access to computers and computer based services. The programs, which are named MyVoice and MyDictate, are complementary in their functions. Both employ ASR engines developed in our lab. The former is used primarily as a mid-size-vocabulary (up to 10K words) voice commander for PC programs and PC-controlled home devices, the latter allows for very-large-vocabulary dictation (with more than 500K words). They are designed to cooperate and thus allow an entirely hands-free access to any computer application, including text typing, e-mail exchange, Internet browsing or handling telephone calls as well as for controlling external home devices such a TV/radio sets or air-conditioning.
FrB.P1a‑4
Management of Static/Dynamic Properties in a Multimodal Interaction System
Kouichi Katsurada, Toyohashi University of Technology
Yuji Okuma, OBIC Co., LTD.
Makoto Yano, Toyohashi University of Technology
Yurie Iribe, Toyohashi University of Technology
Tsuneo Nitta, Toyohashi University of Technology
This paper provides a mechanism to deal with static/dynamic properties in a web-based Multi-Modal Interaction (MMI) system. The static/dynamic properties in this paper include user profile, user’s facial expression, surrounding environment, and so on. By using these properties, the MMI system can make interaction more natural based on context or situation. To consolidate these properties into a module, we have designed and developed a static/dynamic property manager on our MMI system. We have also prototyped a user navigation system that introduced user profile, user’s facial expression and GPS information as static/dynamic properties.
FrB.P1a‑5
Evaluation of Alternatives on Speech to Sign Language Translation
Rubén San-Segundo, Grupo de Tecnología del Habla. Universidad Politécnica de Madrid. Spain
Alicia Pérez, Dpto de Electricidad y Electrónica. Facultad de Ciencia y Tecnología. Universidad del País Vasco. Spain.
Daniel Ortiz, Instituto Tecnológico de Informática. Universidad Politécnica de Valencia. Spain
Luis Fernando D'Haro, Grupo de Tecnología del Habla. Universidad Politécnica de Madrid. Spain
M. Inés Torres, Dpto de Electricidad y Electrónica. Facultad de Ciencia y Tecnología. Universidad del País Vasco. Spain.
Francisco Casacuberta, Instituto Tecnológico de Informática. Universidad Politécnica de Valencia. Spain
This paper evaluates different approaches on speech to sign language machine translation. The framework of the application focuses on assisting deaf people to apply for the passport or related information. In this context, the main aim is to automatically translate the spontaneous speech, uttered by an officer, into Spanish Sign Language (SSL). In order to get the best translation quality, three alternative techniques have been evaluated: a rule-based approach, a phrase-based statistical approach, and a connectionist approach that makes use of stochastic finite state transducers. The best speech translation experiments have reported a 32.0% SER (Sign Error Rate) and a 7.1 BLEU (BiLingual Evaluation Understudy) including speech recognition errors.
FrB.P1a‑6
Speech based Drug Information System for Aged and Visually Impaired Persons
Géza Németh, Department of Telecommunications and Media Informatics, BUTE, Budapest, Hungary
Gábor Olaszy, Department of Telecommunications and Media Informatics, BUTE, Budapest, Hungary
Mátyás Bartalis, Department of Telecommunications and Media Informatics, BUTE, Budapest, Hungary
Géza Kiss, Department of Telecommunications and Media Informatics, BUTE, Budapest, Hungary
Csaba Zainkó, Department of Telecommunications and Media Informatics, BUTE, Budapest, Hungary
Péter Mihajlik, Department of Telecommunications and Media Informatics, BUTE, Budapest, Hungary
Medicine Line (MLN) is an automatic telephone information system operating in Hungary since December 2006. It is primarily intended for visually handicapped persons and elderly people. In Hungary the National Institute of Pharmacy (NIP) coordinates the approval of new drugs and also their Patient Information Leaflets (PIL). Medicine Line reads this textual information chapter by chapter to the citizens. The number of different medicaments used in Hungary is about 5000. New drugs come into practice regularly, and some are withdrawn after a certain time. The MLN system ensures 24 hour access to the information. The spoken dialogue input is processed by a specialized ASR module (the caller tells the name of the drug, the chapter title etc.). The output is given by a TTS synthesizer specialized to read drug names and medical Latin words correctly. The user can control the system by DTMF buttons too. In this article we will focus on features of speech based components.
FrB.P1a‑7
Automatic Speech Recognition with a Cochlear Implant Front-End
Waldo Nogueira, Information Technology Laboratory, Leibniz Universität Hannover, Germany
Tamás Harczos, Fraunhofer Institute for Digital Media Technology, Ilmenau, Germany
Bernd Edler, Information Technology Laboratory, Leibniz Universität Hannover, Germany
Joern Ostermann, Institut fuer Informationsverarbeitung, Leibniz Universitaet Hannover, Germany
Andreas Büchner, Hannover Hörzentrum, Medizinische Hochschule Hannover, Germany
Today, cochlear implants (CIs) are the treatment of choice in patients with profound hearing loss. However speech intelligibility with these devices is still limited. A factor that determines hearing performance is the processing method used in CIs. Therefore, research is focused on designing different speech processing methods. The evaluation of these strategies is subject to variability as it is usually performed with cochlear implant recipients. Hence, an objective method for the evaluation would give more robustness compared to the tests performed with CI patients. This paper proposes a method to evaluate signal processing strategies for CIs based on a hidden markov model speech recognizer. Two signal processing strategies for CIs, the Advanced Combinational Encoder (ACE) and the Psychoacoustic Advanced Combinational Encoder (PACE), have been compared in a phoneme recognition task. Results show that PACE obtained higher recognition scores than ACE as found with CI recipients.
FrB.P1a‑8
Voice Activated Powered Wheelchair with Non-Voice Rejection Algorithm
Soo-Young Suk, National Institute of Advanced Industrial Science and Technology, Japan
Hiroaki Kojima, National Institute of Advanced Industrial Science and Technology, Japan
In this paper, we introduce a non-voice rejection method to perform Voice/Non-Voice (V/NV) classification using a fundamental frequency estimator (F0) called YIN. Although current speech recognition technology has achieved high performance, it is insufficient for some applications where high reliability is required, such as voice control of powered wheelchairs for disabled persons. The non-voice rejection algorithm, which classifies V/NV in Voice Activity Detection (VAD) step, is helpful for realizing a highly reliable system. The proposed algorithm adopts the ratio of a reliable F0 contour to the whole input interval. To evaluate the performance of our proposed method, we used 1567 voice commands and 447 noises in powered wheelchair control in a real environment. These results indicate that the recall rate is 97% when the lowest threshold is selected for noise classification with 99% precision in VAD.
FrB.P1a‑9
Phonetic based sentence level rewriting of questions typed by dyslexic spellers in an information retrieval context
Laurianne Sitbon, LIA - University of Avignon
Patrice Bellot, LIA - University of Avignon
Philippe Blache, LPL - University of Provence
This paper introduces a method combining spell checking and phonetic interpretation in order to automatically rewrite questions typed by dyslexic spellers. The method uses a finite state automata framework. Dysorthographics refers to incorrect word segmentation which usually causes classical spelling correctors fail. The specificities of the information retrieval context are that flexion errors have no impact since the sentences are lemmatised and filtered and that several hypothesis can be processed for one query. Our system is evaluated on questions collected with the help of an orthophonist. The word error rate on lemmatised sentences falls from 60% to 22% (falls to 0% on 43% of sentences).