Context 

The development of socially interactive robots is a motivating challenge, so that a considerable number of mature robotic systems have been developed during the last decade. Moving such robots out of laboratories, i.e. in private homes, to become robot companions is a deeper challenge because communication between a human and an assitant robot must be as natural as possible. In order to perform tasks based on H/R multimodal interaction, we focus our work on speech and gesture recognition, interpretation, and fusion. This work is done in collaboration with the LAAS-CNRS and the modules we are developing in this context are integrated on the LAAS robotic platform called Jido.

Overview

Relevant information are provided by the content of message uttered by the user, her/his gestures as well as her/his body reactive motions. They need to be extracted, processed and merged by the robotic platform. On one hand, fusion enable to specify some missing parameters related to person/object IDs or to location when references are used in verbal statements :  "donne lui un verre" ("Give him a glass"), "donne-moi cette boîte" ("Give me this box"), "Pose la ici" ("Put it down here"). On the other hand, fusion can also be a way to improve the system robustness in noisy environments. All this requires audio and video stream analysis.

 

Speech recognition and interpretation

Natural communication between a person and a robot companion requires to recognize each user utterance and to understand its meaning in relation with the current context represented by a specific task, a place, an object, an action, a set of objects or some other people involved in, but also in some case in relation with a complementary gesture we also need to recognize and to interpret.

Speech recognition :

To process French continuous speech, we use a grammar-based speech engine, (Julian, version of the open source engine Julius developed by the Continuous Speech Recognition Consortium). This engine requires essential linguistic resources : a set of acoustic models for French phonetic units, a phonetic lexicon drawn up from the French lexical database BDLEX and a set of grammars. Each grammar has been specifically designed to describe sentences related to one of the different tasks taken into account in our multimodal interaction scenarios. The user can introduce him/herself "Salut Jido, c'est Paul" ("Hello Jido it's Paul"), give basic movement orders, "Tourne à droite" ("Turn right") or ask for more complex guidance requests "Emmène-moi au hall d'entrée" ("Take me to the entrance hall"). He can also ask the robot to handle or move objects. Such sentences can be fully specified or some location, object or person references can be used : “viens ici” ("Come here"), "Prends cette bouteille" ("Take this bottle"), "Donne-moi la bouteille" ("Give me the bottle"), "Donne la moi" (Give it to me"). In our applicative context references are solved by means of gesture and human position analysis. We are also working to take more spontaneous speech into account (for example including some hesitations "Donne-moi .. euh .. cette bouteille" (Give me ..uh.. this bottle"). The speech recognition output is then processed to be interpreted.

Speech interpretation :

The relevant semantic units are extracted from the recognizer output using the semantic lexicon specifically designed. Meaningful words are related to actions while others are related to objects or object attributes like color or size as well as location or robot configuration parameters (speed, rotation, distance). The global interpretation process transforms the recognized sentence into a valid command to be sent to the robot supervisor [1] in order to be executed. A command will be valid if this command is compatible with one of the interpretation models belonging to our applicative domain. When they are explicitly provided, the command parameters are extracted by this semantic analysis step, otherwise, they are marked as missing and must be specified by a gesture for a complete understanding of the user request.

For more details about the robotics applications, see : RECO module on Rackham and RECO module on Jido.

Visual tracking and gesture recognition

Our system dedicated to the visual perception of the robot's user includes 3D face and two-hand tracking. Based on these tracking results, gesture recognition is performed. This work is still in progress.

For more details about the robotics applications, see : GEST module on Jido.

Audio-visual data fusion

Fusion is tackled from a hierachical and rule-based point of view. A late fusion process is applied according to the speech interpretation result when it specifies that a gesture interpretation output is required and must be integrated in the command that will be sent to the robot supervisor.

Finally, the combination of gesture with deictic and anaphoric utterances have been tested in household robotics operation. (see Demo)

Projects

Part of this work is a contribution to the following european (LAAS) projects :

  • Cogniron ("The Cognitive Robot Companion" : 2004-2008)
  • CommRob ("Advanced Robot behaviour and high-level multimodal communication" : 2007-2009)

 

Contributors

Main publications

Burger Brice, Isabelle Ferrané, Frédéric Lerasle. Multimodal Interaction Abilities for a Robot Companion. Dans : Int. Conf. on Computer Vision Systems (ICVS2008), p 549-558, Santorini (Greece), may 2008

 

Demo

A movie showing visual tracking, speech understanding and probabilistic audio-visual data fusion :

<object style="width:320px;height:260px"> <param name="movie" value="http://www.irit.fr/~Philippe.Joly/Homepage_files/Demo/mediaplayer.swf?file=humrobcom.flv" /> <param name="quality" value="high" /> </object>

Flash player required

Other movies are available on this site.

Access to the LAAS robots page.