Interspeech 2007 Session WeD.SS: Speech and language technology for less-resourced languages
Wednesday, August 29, 2007
16:00 – 18:00
Astrid Scala 1
Briony Williams (University of Wales, Bangor), Mikel Forcada (Universitat d'Alacant)
More detailed information about this session can be found here.
A Morpho-graphemic Approach for the Recognition of Spontaneous Speech in Agglutinative Languages – like Hungarian
Péter Mihajlik, Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Hungary
Tibor Fegyó, AITIA International, Budapest, Hungary
Zoltán Tüske, Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Hungary
Pavel Ircing, University of West Bohemia, Plzen, Czech Republic
A coupled acoustic- and language-modeling approach is presented for the recognition of spontaneous speech primarily in agglutinative languages. The effectiveness of the approach in large vocabulary spontaneous speech recognition is demonstrated on the Hungarian MALACH corpus. The derivation of morphs from word forms is based on a statistical morphological segmentation tool while the mapping of morphs into graphemes is obtained trivially by splitting each morph into individual letters. Using morphs instead of words in language modeling gives significant WER reductions in case of both phoneme- and grapheme-based acoustic modeling. The improvements are higher after speaker adaptations of acoustic models. In conclusion, morpho-phonemic and the proposed morpho-graphemic ASR approaches yield the same best WERs, which are significantly lower than the word-based baselines but essentially without language dependent rules or pronunciation dictionaries in the latter case.
A Semi-Supervised Learning Approach for Morpheme Segmentation for An Arabic Dialect
Mei Yang, University of Washington
Jing Zheng, SRI International
Andreas Kathol, SRI International
We present a semi-supervised learning approach which utilizes a heuristic model for learning morpheme segmentation for Arabic dialects. We evaluate our approach by applying morpheme segmentation to the training data of a statistical machine translation (SMT) system. Experiments show that our approach is less sensitive to the availability of annotated stems than a previous rule-based approach and learns 12% more segmentations on our Iraqi Arabic data. When applied in an SMT system, our approach yields a 8% relative reduction in the training vocabulary size and a 0.8% relative reduction in the out-of-vocabulary (OOV) rate on the test set, again as compared to the rule-based approach. Finally, our approach also results in a modest increase in BLEU scores.
Accelerating the Annotation of Lexical Data for Less-Resourced Languages
Gerhard Beukes van Huyssteen, Centre for Text Technology (CTexT), North-West University
Martin Johannes Puttkammer, Centre for Text Technology (CTexT), North-West University
The development of digital resources is an expensive and time-consuming endeavor, especially in the case of less-resourced languages. In this paper, we describe a freely available, open-source system, called TurboAnnotate, for bootstrapping linguistic data for machine-learning purposes, or for manually creating gold standards or other annotated lists. A detailed description of the design and functionalities of the tool is given, focusing on how the requirements of end-users are being addressed through it. It is indicated that TurboAnnotate does not only promise to help increase the accuracy of human annotators, but also to save enormously on human effort in terms of time.
On Web-based Speech Resource Creation for Less-Resourced Languages
Christoph Draxler, Institut für Phonetik und Sprachverarbeitung
Web-based creation of speech resources is a new paradigm for establishing spoken language resources. It is particularly suited for less resourced languages, i.e. languages for which no readily available speech resources exist. This paper maps the speech resource creation tasks to the cient-server architecture of the WWW. It presents two tools that have been developed for web-based speech resource creation, and it demonstrates the effectiveness of this approach by three use cases: 1) recording new speaker populations in geographically distributed locations, 2) recordings in adverse recording environments, e.g. hospitals, and 3) field recordings of endangered languages. The only infrastructure requirements are electricity for the equipment and an Internet connection.
Building an Information Retrieval System for Serbian - Challenges and Solutions
Miroslav Martinovic, Department of Computer Science, College of New Jersey, U.S.A.
Srdan Vesic, Faculty of Mathematics, University of Belgrade, Serbia
Goran Rakic, Faculty of Mathematics, University of Belgrade, Serbia
We describe challenges encountered while building an information retrieval system for Serbian language. Approaches designed and adopted to handle them are depicted and illuminated in this paper. As a backbone of our system, we used SMART retrieval system which we augmented with features necessary to deal with specificities of the Serbian alphabet. In addition, morphological richness of the language accentuated implications of the text preprocessing phase. During this phase, we devised two algorithms which increased retrieval precision by 14% and 27%, respectively. Testing was conducted using two gigabyte EBART collection of Serbian newspaper articles.
Bootstrapping Morphological Analysis of Gĩkũyũ Using Unsupervised Maximum Entropy Learning
Guy De Pauw, University of Antwerp
Peter Waiganjo Wagacha, University of Nairobi
This paper describes a proof-of-the-principle experiment in which maximum entropy learning is used for the automatic induction of shallow morphological features for the resource-scarce Bantu language of Gikuyu. This novel approach circumvents the limitations of typical unsupervised morphological induction methods that employ minimum-edit distance metrics to establish morphological similarity between words. The experimental results show that the unsupervised maximum entropy learning approach compares favorably to those of the established AutoMorphology method.
The VoiceTRAN Machine Translation System
Jerneja Zganec Gros, Alpineon d.o.o.
Stanislav Gruden, Alpineon d.o.o.
Freely available tools and language resources were used to build the VoiceTRAN statistical machine translation (SMT) system. Various configuration variations of the system are presented and evaluated. The VoiceTRAN SMT system outperformed the baseline conventional rule-based MT system in both English-Slovenian in-domain test setups. To further increase the generalization capability of the translation model for lower-coverage out-of-domain test sentences, an “MSD-recombination” approach was proposed. This approach not only allows a better exploitation of conventional translation models, but also performs well in the more demanding translation direction; that is, into a highly inflectional language. Using this approach in the out-of-domain setup of the English-Slovenian JRC-ACQUIS task, we have achieved significant improvements in translation quality.
MuLAS: A Framework For Automatically Building Multi-Tier Corpora
Sergio Paulo, INESC-ID/IST
Luis C. Oliveira, INESC-ID/IST
The Multi-Level Alignment System (MuLAS) is the L2F tool for building multi-tier speech corpora with reduced or no human intervention at all. MuLAS automatically combines information coming from external speech annotations, human or machine-generated, with the text-based utterance descriptions that it creates, in order to build more reliable and complete descriptions of the spoken utterances. This paper presents our methods for multi-tier annotation synchronization, which lie behind the MuLAS operation. Such methods have allowed us to expand the building of multi-tier corpora to new languages without spending too much effort. MuLAS has been successfully applied to the building of multi-tier corpora for speech synthesis in American and British English, European Portuguese and German. Natural prosody generation has benefited from MuLAS, too, since prosodic models can be derived from corpora built by MuLAS.
Creating multimedia dictionaries of endangered languages using LEXUS
Jacquelijn Ringersma, Max Planck Institute for Psycholinguistics, Nijmegen
Marc Kemps-Snijders, Max Planck Institute for Psycholinguistics, Nijmegen
This paper reports on the development of a flexible web based lexicon tool, LEXUS. LEXUS is targeted at linguists involved in language documentation (of endangered languages). It allows the creation of lexica within the structure of the proposed ISO LMF standard and using the proposed concept naming conventions from the ISO data categories, thus enabling interoperability, search and merging. LEXUS also offers the possibility to visualize language, since it provides functionalities to include audio, video and still images to the lexicon. With LEXUS it is possible to create semantic network knowledge bases, using types relations. The LEXUS tool is free for use.
IceNLP: A Natural Language Processing Toolkit for Icelandic
Hrafn Loftsson, Reykjavik University
Eiríkur Rögnvaldsson, University of Iceland
Icelandic is a morphologically complex language, for which language technology resources are scarce. Only a few years ago, it could be stated that language technology was practically non-existent in Iceland. In this paper, we describe the development of an NLP toolkit for processing the language, the challenges faced and the decisions made during development. The current version of the toolkit consists of a tokeniser/sentence segmentiser, a morphological analyser, a linguistic rule-based tagger, and a finite-state parser. The development of our toolkit is a step towards building a Basic Language Resource Toolkit (BLARK) for the Icelandic language.
Phonotactic spoken language identification with limited training data
Marius Peche, Meraka Institute
Marelie Davel, Meraka Institute
Etienne Barnard, Meraka Institute
We investigate the addition of a new language, for which limited resources are available, to a phonotactic language identification system. Two classes of approaches are studied: in the first class, only existing phonetic recognizers are employed, whereas an additional phonetic recognizer in the new language is created for the second class. It is found that the number of acoustic recognizers employed plays a crucial role in determining the recognition accuracy for the new language. We study different approaches to incorporating a language for which audio- only data is available (no pronunciation dictionaries or transcriptions) and find that if more than about 2 000 training utterances are available, a bootstrapped acoustic model for the new language can improve accuracy substantially.
Automatic Speech Recognition for an Under-Resourced Language - Amharic
Solomon Teferra Abate, University of Hamburg
Wolfgang Menzel, University of Hamburg
In this paper we present the development of an Automatic Speech Recognition System (ASRS) for Amharic using limited available resources and the freely available speech toolkit (HTK). There are phonological, dialectal, orthographic and morphological features of Amharic that challenge the development of ASRSs. The problem of resource scarcity is also a hindrance to the research and development initiatives in the area of Amharic ASR. Dealing with these language and resource related problems, we have developed syllable- and triphone-based ASR for Amharic and achieved 90.43% and 91.31% word recognition accuracy, respectively.
Information Retrieval Strategies for Accessing African Audio Corpora
Abdillahi Nimaan, LIA
Pascal Nocera, LIA
Frédéric Bechet, LIA
Jean-François Bonastre, LIA
In this paper we present a first approach to access africain oral corpora, combining automatic speech recognition and information retrieval. Firstly, we present the principal characteristics of our somali speech recognizer and the results obtained on real audio archives gathered from Djibouti Radio. Secondly, we propose a Hybrid Language Model (HLM) including words and sub-words to improve the robustness against OOV words. We proceed to Information Retrieval experiments with various strategies. We search on the differents outputs of the ASR system (words, sub-words and hybrid). We finally present a new strategy combining sub-words and words to enhance the information retrieval results.
Morfessor and VariKN machine learning tools for speech and language technology
Vesa Siivola, Helsinki University of Technology
Mathias Creutz, Helsinki University of Technology
Mikko Kurimo, Helsinki University of Technology
This paper introduces two recent open source software packages developed for unsupervised natural language modeling. The Morfessor program segments words automatically into morpheme-like units without any rule-based morphological analyzers. The VariKN toolkit trains language models producing a compact set of high-order n-grams utilizing state-of-art Kneser-Ney smoothing. As an example, this paper shows how to construct a language model for speech recognition in multiple languages utilizing only a minimal amount of linguistic resources. Morfessor and VariKN also have other applications in text understanding, information retrieval and machine translation. Unsupervised machine learning techniques are particularly well suited for the development of systems for less-resourced languages, because they do not depend on manually designed morphological or syntactical analyzers or annotated data.
Towards Better Language Modeling for Thai LVCSR
Markpong Jongtaveesataporn, Tokyo Institute of Technology
Issara Thienlikit, Tokyo Institute of Technology
Chai Wutiwiwatchai, National Electronics and Computer Technology Center, Thailand
Sadaoki Furui, Tokyo Institute of Technology
One of the difficulties of Thai language modeling is the process of text corpus preparation. Because there is no explicit word boundary marker in written Thai text, word segmentation must be performed prior to training a language model. This paper presents two approaches to language model construction for Thai LVCSR based on pseudo-morpheme merging. The first approach merges pseudo-morphemes using forward and reverse bi-grams. The second approach utilizes the C4.5 decision tree to merge pseudo-morphemes based on multiple features. The performance of ASR systems with language models built using these methods are better than systems which use only pseudo-morpheme or lexicon-based word segmentation. These approaches produce results comparable to that obtained by the system utilizing manual segmentation.
ELAN: a Free and Open Source Multimedia Annotation Tool
Han Sloetjes, Max Planck Institute for Psycholinguistics, Nijmegen
Albert Russel, Max Planck Institute for Psycholinguistics, Nijmegen
Alexander Klassmann, Max Planck Institute for Psycholinguistics, Nijmegen
In this demo we will show the main features and capabilities of ELAN. ELAN is a multipurpose, multimedia annotation tool, available for multiple platforms and it is being developed at the Max Planck Institute for Psycholinguistics.
SpeechIndexer in Action: Managing Endangered Formosan Languages
Jozsef Szakos, Department of Indigenous Languages and Communication, National Dong Hua University, Hualian, Taiwan
Ulrike Glavitsch, Department of Computer Science, ETH Zurich, Switzerland
Among the most endangered languages of the World, Formosan Austronesian vernaculars occupy a sadly prominent place. Two decades spent in language documentation would only weakly repair the broken transmission chain between the generations of speakers, if there had been no chance to develop SpeechIndexer, a novel software combining portability, adaptability for processing large amount of speech data and its first transcription. This presentation introduces the situation of the languages under investigation and shows examples of SpeechIndexer used for marking up and retrieving spoken corpus data, archived for preservation.
A Portable Record Player for Wax Cylinders using a Laser-beam Reflection Method
Tohru Ifukube, University of Tokyo
Yasuyuki Shimizu, Japan Women's University
The wax phonograph cylinder invented by Thomas Edison in 1885 was the medium for recording sound until about 1930. In around 1900, using the Edison-type phonograph, a Polish anthropologist (B. Pilsudski) recorded the songs of the Ainu people in the most northern Japan on 65 wax cylinders. The wax cylinders were accidentally discovered in Poland and we were asked to reproduce them in 1984. Most of them, however, changed in quality by re-crystallization and had many cracks on their surfaces. The Pilsudski’s wax cylinders were successfully reproduced by using a laser-beam reflection as well as a light stylus which we developed.