Multimodal Human Robot Interaction

Jpetiot/ janvier 7, 2014/ Applications


The development of socially interactive robots is a motivating challenge, so that a considerable number of mature robotic systems have been developed during the last decade. Moving such robots out of laboratories, i.e. in private homes, to become robot companions is a deeper challenge because communication between a human and a robotic home assitant must be as natural as possible. In order to perform tasks based on H/R multimodal interaction, we have focused our work on speech and gesture recognition, interpretation, and fusion. Our first approach was a command-oriented approach, enabling a human user to communicate with the robot using speech and gesture as shown in Figure 1. Specific modules developed in the scope of Brice burger’s PhD have been integrated on the LAAS robotic platforms (HRP2, Jido). 

Figure 1: Multiimodal Human Robot interaction based on speech/gesture command

Since 2006, we have been working on this research topic in collaboration with the RAP team from LAAS-CNRS. To go a step further and enable robots to tackle tasks in a daily and assisted living context, we are focussing our work on the robot awareness, related to its perception of the user and his home environment.  We are currently working on this research topic through the RIDDLE project and Christophe Mollaret’s PhD. 


Previous work on multimodal command-based interaction (Brice Burger’s PhD 2006-2010)

Messages uttered by the user, his gestures as well as his body motions carry relevant cues a robot has to extract, process and merge in order to interact with him. The role of fusion is two-fold. It enables to specify missing parameters related to the person, object, or location when references are used in verbal statements :  “donne lui un verre” (“Give him a glass”), “donne-moi cette boîte” (“Give me this box”), “Pose la ici” (“Put it down here”). It can also be a way to improve the system robustness in noisy environments. All this requires to analyze audio and video streams, recognize speech and gesture, understand and merge them, and then react depending on the robotic platform abilities.

Speech recognition has been implemented using Julius, and its grammar-based version. Acoustic models we have developed for French have been used as well as lexical resources from BDLEX. Grammars dedicated to the differents implemented tasks have been designed to process utterances like : greetings “Hello I am Paul“, simple moving commands “Turn left” or more complex ones “Take me to the entrance hall“, object handling or moving commands “Pick up the red bottle” some of them including spatial references and deictics , “Put this glass here” …

When explicitely provided, information about actions, objects or their attributes like color or size, as well as location or the robot configuration parameters (speed, rotation, distance) are extracted. This speech interpretation results are used to build a command which is then validated according to the interpretation models related to the applicative domain.  Cues obtained from visual tracking (3D face, gaze hands) enable to perform gesture recognition (HMM-based). Fusion is tackled from a hierachical and rule-based point of view. A late fusion process is applied according to the speech interpretation result when a gesture interpretation output is required and must be integrated to fill missing parameters or improve confidence scores. Tehn, the command is sent to the robot supervisor. User’s multimodal requests are processed according to this speech-driven approach. 

For more details about the robotics applications, see : RECO module on RackhamRECO module on JidoGEST module on Jido.

Current work on multimodal perception and awareness (Christophe Mollaret’s PhD 2012-2015)

A robot designed to assist a person in his daily life and enable to propose a set of services that meet the needs or the expectations of this person has to cope with different situations, regarding the environment first and its static and dynamic “components”. Static because of the rooms and furniture; dynamic because of the person activities and their impact on surrounding objects. Multimodal perception of human, objects and activities is a challenging issue as well as decoding the user’s intentions and interact with him at the right moment. To address this issue, we have chosen to focus our research work first on the detection and measurement of the user intentionality, by making the robot aware of the user expectations in terms of interactionObserving the current situation, monitoring the user’s main activities (answering to phone calls, watching TV, cooking, …) detecting when the user has the intention to interact with the robot is a first step towards the robot awarness of its environment.

Our first approach is based on visual tracking using RGB-D information in order to focus on the distance between the user and the robot, the orientation of the user’s shoulders and face. Our work hypothesis consists in considering that the user, when ready to interact with the robot, will act naturally and will face it to start interaction. 

The first part of our work has been to build a tracker from an evolutionary optimization approach, the PSO (Particle Swarm optimization) algorithm and to propose an extension in which the system dynamics is explicitly taken into consideration and which performs an efficient tracking. This tracker is also shown to outperform several algorithms: SIR (Sampling Importance Resampling) algorithm with random walk and constant velocity model;  a previously PSO inspired tracker, SPSO (SequentialParticle Swarm Optimization). This has been applied on simulated data and real data as illustrated in Figure 2.

Figure 2 : RGB-D sequence capture : red spot (shoulder detection) green spot head pose detection 

Our current research and collaborative work (with RAP team) focuses now on fusion of visual tacking results, audio and speech activity in order to detect the user’s intentionality and to start a proximal interaction with him regarding the current context in order to take the user’s environment, the target object locations, the user previous activities and the current user request. In the context of the RIDDLE project, interaction mais topic will be centered on a subset of usual objects like, keys, galsses, remote commands, mobile phone, … The whole process to be set up is depicted herebelow in Figure 3. 

Monitoring the user in his daily life environmentDetecting user intentionality based on video and audio dataMoving towards the user to start proximal interactionContext-guided Multimodal dialog (speech & gesture)

 Figure 3 : Multimodal perception for human robot interaction in the RIDDDLE project scope.


The current research work on Multimodal Human Perception for human Robot Interaction is carried out whitin the ANR RIDDLE Project.

Part of previous work on multimodal command for human robot interaction was a contribution to the following european (LAAS) projects :

  • Cogniron (“The Cognitive Robot Companion” : 2004-2008)
  • CommRob (“Advanced Robot behaviour and high-level multimodal communication” : 2007-2009)


Main publications

Burger BriceIsabelle FerranéFrédéric LerasleMultimodal Interaction Abilities for a Robot Companion. Dans : Int. Conf. on Computer Vision Systems (ICVS2008), p 549-558, Santorini (Greece), may 2008

Brice BurgerFrédéric LerasleIsabelle FerranéAurelie ClodicMutual Assistance between Speech and Vision for Human-Robot Interaction. Dans / In : IEEE/RSJ International Conference on Intelligent RObots and Systems (IROS 2008)Nice, France22/09/2008-26/09/2008IEEE, p. 4011-4016, 2008. URL : 

Brice BurgerIsabelle FerranéFrédéric LerasleGuillaume InfantesTwo-handed Gesture Recognition and Fusion with Speech to command a Robot. Dans / In : Autonomous RobotsSpringer, Vol. AURO655.3, (en ligne), 2012. 
URL : 

Christophe Mollaret, Frédéric Lerasle, Isabelle Ferrané, Julien Pinquier. A Particle Swarm Optimization inspired tracker applied to visual tracking’ ICIP 2014, Paris, octobre 2014, to be published.

Share this Post