Interspeech 2007 Session ThD.O3: Multimodal interaction: analysis and technology
Type
oral
Date
Thursday, August 30, 2007
Time
16:00 – 18:00
Room
Marble
Chair
Helen Meng (The Chinese University of Hong Kong)
ThD.O3‑1
16:00
Behavior Models for Learning and Receptionist Dialogs
Hartwig Holzapfel, interACT Research, Universität Karlsruhe, Germany
Alex Waibel, interACT Research, Universität Karlsruhe, Germany
We present a dialog model for identifying persons, learning person names, associated persons and face IDs in a receptionist dialog. The proposed model allows a decomposition of the main dialog task into separate dialog behaviors which can be implemented separately and allow a mixture of handcrafted models and dialog strategies trained with reinforcement learning. The dialog model was implemented on our robot and tested in a number of experiments in a receptionist task. A Wizard-of-Oz experiment is used to evaluate the dialog structure, delivers information for the definition of metrics, and delivers a data corpus which is used to train a user simulation and component error model. Using these models we train a dialog module for learning a person's name with reinforcement learning.
ThD.O3‑2
16:20
Design of a Rich Multimodal Interface for Mobile Spoken Route Guidance
Markku Turunen, University of Tampere
Jaakko Hakulinen, University of Tampere
Anssi Kainulainen, University of Tampere
Aleksi Melto, University of Tampere
Topi Hurtig, University of Tampere
We present a design of a rich multimodal interface for mobile route guidance. The application serves public transport information in Finland, including support for pedestrian guidance when the user is changing between the means of transportation. The range of input and output modalities include speech synthesis, speech recognition, fisheye GUI, haptics, contextual text input, physical browsing, physical gestures, non-speech audio, and global positioning information. Together, these modalities provide an interface that is accessible for a wide range of users including people with various levels of visual impairment. In this paper we describe functionality and interface design of the publicly available prototype system.
ThD.O3‑3
16:40
The Virtual Guide: A Direction Giving Embodied Conversational Agent
Mariet Theune, University of Twente
Dennis Hofs, University of Twente
Marco van Kessel, University of Twente
We present the Virtual Guide, an embodied conversational agent that can give directions in a 3D virtual environment. We discuss how dialogue management, language generation and the generation of appropriate gestures are carried out in our system.
ThD.O3‑4
17:00
Creating Spoken Dialogue Characters from Corpora without Annotations
Sudeep Gandhe, University of Southern California ICT
David Traum, University of Southern California ICT
Virtual humans are being used in a number of applications, including simulation-based training, multi-player games, and museum kiosks. Natural language dialogue capabilities are an essential part of their human-like persona. These dialogue systems have a goal of being believable and generally have to operate within the bounds of their restricted domains. Most dialogue systems operate on a dialogue-act level and require extensive annotation efforts. Semantic annotation and rule authoring have long been known as bottlenecks for developing dialogue systems for new domains. In this paper, we investigate several dialogue models for virtual humans that are trained on an unannotated human-human corpus. These are inspired by information retrieval and work on the surface text level. We evaluate these in text-based and spoken interactions and also against the upper baseline of human-human dialogues.
ThD.O3‑5
17:20
Complementarity and Redundancy in Multimodal User Inputs with Speech and Pen Gestures
Pui Yu Hui, The Chinese University of Hong Kong
Helen Meng, The Chinese Unviersity of Hong Kong
Zheng Yu Zhou, The Chinese Unviersity of Hong Kong
We present a comparative analysis of multi-modal user inputs with speech and pen gestures, together with their semantically equivalent uni-modal (speech only) counterparts. The multi-modal interactions are derived from a corpus collected with a PDA emulator in the context of navigation around Beijing. We devise a cross-modality integration methodology that interprets a multi-modal input and paraphrases it as a semantically equivalent, uni-modal input. Thus we generate parallel multi-modal (MM) and uni-modal (UM) corpora for comparative study. Empirical analysis based on class trigram perplexities shows two categories of data: (PPMM < PPUM) and (PPMM = PPUM). The former involves complementarity across modalities in expressing the user’s intent. The latter involves redundancy, which will be useful for handling recognition errors by exploring mutual reinforcements. We present explanatory examples of data in these two categories.
ThD.O3‑6
17:40
Children’s convergence in referring expressions to graphical objects in a speech-enabled computer game
Linda Bell, TeliaSonera
Joakim Gustafson, KTH
This paper describes an empirical study of children’s spontaneous interactions with an animated character in a speech-enabled computer game. More specifically, it deals with convergence of referring expressions. 49 children were invited to play the game, which was initiated by a collaborative “put-that-there” task. In order to solve this task, the children had to refer to both physical objects and icons in a 3D environment. For physical objects, which were mostly referred to using straight-forward noun phrases, lexical convergence took place in 90% of all cases. In the case of the icons, the children were more innovative and spontaneously referred to them in many different ways. Even after being prompted by the system, lexical convergence took place for only 50% of the icons. In the cases where convergence did take place, the effect of the system’s prompts were quite local, and the children quickly resorted to their original way of referring when naming new icons in later tasks