Click on the images to see the content of the researches.
Some links may be down, we are sorry for that…
Sort by the most recent :
The team possesses an important know-how and expertise in low-level segmentation.
In audio, most of the works use the forward-backward segmentation algorithm. A robust version (adverse environment, language and speaker independent) permits to locate the pertinent information, to extract and use it in various domains:
- In automatic language identification: from the identification of vocalic segments, a new prosodic unit, called the pseudo-syllable, has been defined to characterize the rhythm and the intonation. So, the prosody may be so modeled and introduced in an automatic language identification system, to complete the acoustic and phonetic modeling.
- In automatic speaker verification: the automatic segmentation provides the transient zones which are speaker informative.
- In speech/music detection: the behavior of the segmentation process is quite different in speech and music. The modeling of the segment distribution makes the speech/music discrimination more robust.
In video, most of the analysis are issued from a preliminary segmentation into shots by hard cut detections and dissolve localizations. Some extensions to this tool allow also to analyze compositing effects (overlay detection, split screen localization, and so on). In some cases, a content spatiotemporal representation, called “X-ray” image is performed to obtain a micro-segmentation in homogeneous camera works.
- Dedicated Features for Music Genre Classification 21 January 2020
- Characterizing Pathological Voices 7 January 2016
- Segmentation in singer turns 7 January 2014
- Deformable / non-deformable object analysis 7 January 2014
- Spectral Cover 7 January 2013
- Multiple sources detection 7 January 2013
- Unison Choir Detection 7 January 2012
- Rhythm estimation 7 January 2010
- Monophony / Polyphony Distinction 7 January 2009
- Generic GLR/BIC Audio-Video Segmentation 7 January 2009
The studied methods of classification and data mining derive from both the generative and discriminative approaches. Generally, they refer to a supervised learning. Even if “Hidden Markov Models” (HMM) remain the main framework in which the team develops its own models and classifiers, some new approaches have been investigated:
- To exploit the parameter correlation, Dynamic Bayesian Networks have been studied. They lead to a more confidence and robustness than HMM.
- A new model issued from SVM has been proposed to take into account numerous databases and to process observation vector sequences of variable size. A new kernel between pairs of sequences has been theoretically studied in a SVM scheme and it has been implemented for automatic speaker verification.
- A multi-level human model has been proposed to analyze human motion, without prior knowledge about the video source. The proposed model is decomposed in three hierarchical levels, each of them corresponding to a resolution level. Current developments concern the hierarchical decomposition and the matching process handling through each model levels. This is done in an appropriate way to deal with spatial and temporal constraints, and to take into account dynamic invariant aspects in human motion.
- As the information sources are very often multiple (inside a media or cross media), the integration method becomes a strategic key. Generally, multimodal integration methods are usually classified as decision fusion (or late fusion) and early fusion. To overcome the classical combination of weighted scores or the obvious concatenation of the observation vectors, several strategies have been studied in a formal methodology: they rely on confidence index (for classes, experts and observations) and on probabilistic and uncertainty theories. First experiments have been performed for automatic language identification where acoustic, phonotactic and prosodic information are merged.
- Automated Audio Captioning 7 March 2023
- Acoustic-to-articulatory Inversion 7 January 2009
- Prosody Modelling 7 January 2005
- Differentiated Modeling 7 January 2002
The main goal of our group on that topic is to define tools able to perform the structure analysis on the audio and the video tracks at the same time. To do so, two main approaches are currently explored:
- The first one aims at highlighting the existence of temporal relationship between two types of event. An event has to be considered here as a “segment in which a given type of content can be observed” such as a given face, a given speaker, music, a graphical icon, etc. The temporal relation between two segments can be characterized with three numerical parameters. Considering this, each couple of segments produced by two segmentation processes, can be associated with one point in a 3D space. We so proceed to a vote for all the temporal relations which can be found between all the segments automatically identified. The vote distribution in the 3D matrix can then be used to identify different pieces of information such as:
- The association between a voice and a face (corresponding to a same person),
- The players belonging to a same team in a TV game,
- The entertainer identification in a radio program,
- Etc
- The second one aims at establishing a style similarity measure between two recordings. The input of this method is any low-level feature associating series of numerical values to an audiovisual content along the temporal dimension. Then local best matching between these values extracted from two different contents are performed with an optimized algorithm. All the quality rates of each match are integrated in a 2D matrix. Here again, the distribution of the higher coefficients in this matrix allows the characterization of two types of similarity:
- Diagonal distributions correspond to a strict similarity. This can be observed when some subparts of the compared documents are exactly the same.
- Distributions in “blocks” correspond locally to a style similarity. The content in the corresponding segments is not exactly the same, but presents some common properties.
From the “similarity matrix” a global similarity measure can be then extracted and used for a document classification task for example.
- Audiovisual signature 7 January 2016
- Multimodal Spatio-temporal clustering 7 January 2012
- Interaction and Speaker Role Detection 7 January 2010
- Adaptative User-Defined Similarity Measure 7 January 2008
- Temporal Relation Matrix 7 January 2007
Here are some applications for our research.
- Comprehensibility of audiovisual content 13 July 2023
- Mobile application for automatic intelligibility measurement 9 June 2023
- Audio Fingerprinting for TV channel detection in real time 20 February 2021
- Clinical relevance of speech intelligibility measures 22 September 2020
- Functional impact of speech disorders 21 September 2020
- Reading mistakes detection in children’s speech 15 January 2020
- Water sound detection 7 January 2017
- Multimodal Human Robot Interaction 7 January 2014
- Singing Voice Detection 7 January 2009
- High Level Feature Extraction 7 January 2009