Comprehensibility of audiovisual content

Iferrane/ juillet 13, 2023/ Applications


The teaching of foreign languages often requires the use of audiovisual contents to make learners more familiar with native speech production and to improve their oral communication skills. However, the tasks of selecting and preparing these contents for a didactic purpose can be weighty task and thus limit the number of documents exploited by foreign language teachers.

In order to provide teachers with tools helping them to select movie sequences adapted to their teaching criteria, like topic, interaction context, linguistic level or potential exercises, we focused on the definition of an objective measure that may reflect the level of comprehensibility of the targeted audiovisual contents.

This research work, carried out as an industrial PhD, was based on the collaboration of the SAMoVA team from IRIT and a French company (Archean Labs) through the joint laboratory ALAIA (funded by the French Research Agency) on Foreign Language learning assisted by Artificial Intelligence. The PhD defense took place in October 2022.


Firstly, we were interested in the phenomena known in the literature to influence the comprehensibility of a document depending on its modality (text, audio, and audiovisual). Two complementary points of view were considered: foreign language didactics and automatic processing. As far as we knew at the beginning of this research work, there was no resources that would allow us to study the comprehensibility of audiovisual contents and define an accurate measure to evaluate its associated level. In our study we focused on French as a foreign language.

Comprehensibilty as defined in our research work

In this context, comprehensibility can be defined as the capacity of an audiovisual document to be understood by learners according to their level in the target language. The level of comprehensibility of an audiovisual content, which can vary from easy to difficult, can be analyzed considering each document as a whole, or from a modality point of view.

The ESCAL corpus

Our first contribution was to build a corpus gathering short extracts from different French movies (audiovisual content) and collecting teachers’ annotations in terms comprehensibility levels. Each extract was manually selected for its content in terms of communication situation between characters (from monologues to multi-speakers conversations) and more specifically in terms of interaction objective, spatial and temporal contexts, as well as participants (number, role, …) as defined in 1999 by Traverso [1]. Three modalities were considered:

  • (T) text, as the exact transcript of what is said by each character;
  • (A) audio, as the corresponding soundtrack;
  • (V) image, as the corresponding image sequence.

as well as their different combinations: (AT) (AV) and (AVT).

This corpus, named ESCAL for Etude subjective de la compréhensibilité pour l’apprentissage des langues (Subjective study of the comprehensibility in a language learning context) gathers 55 extracts from 15 different French movies (representing 43 minutes of audiovisual content). It was presented to a panel of 15 experts, teachers of French as a Foreign Language who were in charge of evaluating the comprehensibility level of each extract under one of the above 5 modalities (T, A, V, AT, AV or AVT). Four dimensions were considered for these evaluations: the vocabulary, the grammar and the intelligibility (when audio was considered) as well as the whole document level. Evaluations were done on a scale from 0 up to 100 (from very easy to very difficult). Following the protocol described in detail in [2] each of the 275 documents (one extract along with one of the five modalities available) was evaluated by 3 experts. A first analysis of the factors influencing comprehensibility as well as a detailed analysis of these subjective evaluation results were respectively presented in [3] and [4].

Automatic approaches for measuring comprehensibility

ESCAL being our ground truth, two approaches for predicting the comprehensibility level of a new document were investigated:

– A classical one, called “interpretable” based on parameter extraction and selection followed by a supervised classification step (multiple linear regression) was implemented. The underlying motivation relied on the ability to describe, explain and make explicit the relationship between the predictions obtained for each dimension mentioned and the parameters kept. The idea was to allow teachers to understand exactly what impacts the predicted level of comprehensibility.

–  A second approach called “neuronal”, more global and less interpretable was developed, based on existing neuronal models and the representation they can provide for each modality or after modality fusion. Several fusion strategies were investigated. The motivation here was to see if models derived from more current technologies (deep learning) are more efficient than classical ones.

Comparison and results

The models from both approaches were compared using the Pearson correlation (r) and the root mean square error (RMSE). Results were obtained with the Leave-One-Out cross-validation method applied to each dimension. Through this comparative study, we were able to see that, from a quantitative point of view, multiple linear regressions provide the best predictions.

To conclude this work, the best interpretable model (r= 0,38 et RMSE=17,88) was applied to a test corpus of 10 new movie extracts and evaluated by a panel of 22 new expert users. The idea was to compare their perception of the level of comprehensibility with the predictions provided by our model. The Spearman correlation (0.68 with a pvalue < 0.05) shows that our system behaves close to one of the two clusters of human annotators (obtained after an automatic clustering step) when it comes to classify audiovisual documents in terms of comprehensibility level.


Estelle Randria (PhD) ; IRIT Supervisors: Isabelle Ferrané, Julien Pinquier (HDR);

Archean Labs supevisor: Lionel Fontan (Archean Labs).

References and related publications

  • [1] Véronique Traverso. L’analyse des conversations. Paris: Nathan, 1999.
  • [2] Estelle Randria, Compréhensibilité de contenus audiovisuels : quelles approches pour une mesure objective ? Informatique [cs]. Doctorat de l’Université Paul Sabatier (Toulouse 3), 2022. Français. ⟨NNT : 2022TOU30258⟩ – Accès:
  • [3] Estelle I. S. RandriaLionel FontanMaxime Le CozIsabelle FerranéJulien Pinquier, Subjective Evaluation of Comprehensibility in Movie Interactions, 12th Conference on Language Resources and Evaluation (LREC 2020), European Language Resources Association (ELRA), May 2020, Marseille, France. pp.2348-2357 – Accès:
  • [4] Estelle RandriaLionel FontanMaxime Le CozIsabelle FerranéJulien Pinquier, Étude des facteurs affectant la compréhensibilité de documents multimodaux : une étude expérimentale 6e conférence conjointe Journées d’Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d’Études sur la Parole, 2020, Nancy, France. pp.534-54 – Accès:
Share this Post