Open source smart assistant dedicated to companies and profesional usages
Main issues and objectives
As part of its strategy to develop innovative open source tools in collaborative context, LINAGORA, leader of the LinTO project, aims at designing a smart conversational assistant dedicated to companies. This assistant, named after the project, provides speech-driven services to company staff members.
On one hand, personal services allow users to manage their calendar, book a meeting room or control its devices (shutters, lights, video projector). On the other hand, LinTO has been designed for a more collective purpose, that is to assist users to manage meeting minutes.
Two companies, LINAGORA and ZELROS, as well as three academic labs IRIT, LAAS and LIX, contributed to this project.
During this 3-year project, our contribution to LinTO was two-folds:
Improvement of automatic speech recognition (ASR) in meeting context
Abdel Heba, Industrial PhD student, was supervized by Thomas Pellegrini and Régine André-Obrecht from SAMoVA and Jean-Pierre Lorré from LINAGORA. His work intitled « Large Vocabulary Automatic Speech Recognition: from hybrid to End-to-End approaches”, has been a great contribution to the ASR improvement issue [Heba, 2021], [Heba et al., 2019].
Enrichment of automatic speech transcripts with non verbal information
Our team was mainly involved in the process of analyzing conversational and spontaneous speech in meeting contexts. The idea was to extract non verbal information from both audio and visual modalities to characterize conversational interaction between participants.
This work was done in collaboration with researchers from the LAAS. They brought their expertise on image and video processing to complement our own expertise on audio ans speech processing. After defining use-case scenarios, several issues were studied:
(1) equipping LinTO with physical devices
to capture audio (microphone array) and video (360° panoramic camera) streams during multi-party meetings;
(2) analyzing audio streams
to segment each meeting into macro-segments corresponding to successive interaction sequences. This work was based on “non speech” segments, more specifically “silences” (pauses), their detection, characterization and classification [Pibre et al., 2021];
(3) analyzing and merging data from different modalities (sound, images).
Considering the multi-speakers and multi-targets context corresponding to meetings is complex regarding the variability of the number of participants and of the person-sensor distances. Our main goal was to detect, among the meeting participants and on the basis of these two modalities, who is currently the active speaker, combining face detection, gaze orientation, optical flow and diarization results. Preliminary studies were done on the AMI corpus [Madrigal et al., 2020]
(4) characterizing each participant by its audio-visual signature.
Such a representation enables the system to be independent from any external models which could identify each participant as a specific person (for privacy matters). This approach is based on the computation, the comparison and the combination of representations, based on each participant first intervention (during the round table sequence of the current meeting) and was applied to the LinTO corpus [Pibre et al. 2021b, submitted to MTAP];
(5) integrating information from the acoustic analysis to enrich automatic transcripts
Automatic segmentation of meeting transcripts into dialog acts benefits from indicators produced by the audio stream analysis. Actually, information like speaker changes, intonation variations (up vs down or stable), pauses which may bring interesting cues about utterance pounctuation, improve this high level segmentation step. This work was done in collaboration with the IRIT-MELODI team, who led the conversation analysis task of the LinTO project [Gravelier et al., submitted to EMNLP 2021].
Interface for result visualization
To highlight the work carried out during this project and to better show the kind of results obtained, an interface dedicated to result visualization has been developed.
As shown on the snapshot above, on the left you can select the type of corpus we used (LinTO, AMI, …) and the type of information you want to display. Both streams, visual (up) and audio segments (down) related to participants are shown. Links between participants’ speech turns and spatial location in the video are represented by a same color. This interface can be found here (downloading annotations can take some time).
A video commented in French is also available to explain the way the interface works and the types of results presented. It also includes results from other partners LAAS, LINAGORA, and IRIT/MELODI concerning respectively action detection, hand gesture detection when participants are voting or asking the floor and automatic segmentation into dialog acts.
People involved in the SAMOVA team
- Isabelle Ferrané (IRIT scentific coordinator) – Thomas Pellegrini – Julien Pinquier
- withe the contribution of: Yassir Bouiry (Short term contract STC), Sélim Mechrouh (intern), Cyrille Equoy (STC), Lila Gravelier (STC) and Gautier Arcin (STC).
- Programme d’Investissements d’Avenir – GRANDS DEFIS DU NUMERIQUE- 2018
- Funded by BPI France
- From 1st April 2018 to 31st March 2021