Research - SAMoVA

The team possesses an important know-how and expertise in low-level segmentation.

In audio, most of the works use the forward-backward segmentation algorithm. A robust version (adverse environment, language and speaker independent) permits to locate the pertinent information, to extract and use it in various domains:

In automatic language identification: from the identification of vocalic segments, a new prosodic unit, called the pseudo-syllable, has been defined to characterize the rhythm and the intonation. So, the prosody may be so modeled and introduced in an automatic language identification system, to complete the acoustic and phonetic modeling.
In automatic speaker verification: the automatic segmentation provides the transient zones which are speaker informative.
In speech/music detection: the behavior of the segmentation process is quite different in speech and music. The modeling of the segment distribution makes the speech/music discrimination more robust.

In video, most of the analysis are issued from a preliminary segmentation into shots by hard cut detections and dissolve localizations. Some extensions to this tool allow also to analyze compositing effects (overlay detection, split screen localization, and so on). In some cases, a content spatiotemporal representation, called “X-ray” image is performed to obtain a micro-segmentation in homogeneous camera works.

Description

The studied methods of classification and data mining derive from both the generative and discriminative approaches. Generally, they refer to a supervised learning. Even if “Hidden Markov Models” (HMM) remain the main framework in which the team develops its own models and classifiers, some new approaches have been investigated:

To exploit the parameter correlation, Dynamic Bayesian Networks have been studied. They lead to a more confidence and robustness than HMM.
A new model issued from SVM has been proposed to take into account numerous databases and to process observation vector sequences of variable size. A new kernel between pairs of sequences has been theoretically studied in a SVM scheme and it has been implemented for automatic speaker verification.
A multi-level human model has been proposed to analyze human motion, without prior knowledge about the video source. The proposed model is decomposed in three hierarchical levels, each of them corresponding to a resolution level. Current developments concern the hierarchical decomposition and the matching process handling through each model levels. This is done in an appropriate way to deal with spatial and temporal constraints, and to take into account dynamic invariant aspects in human motion.
As the information sources are very often multiple (inside a media or cross media), the integration method becomes a strategic key. Generally, multimodal integration methods are usually classified as decision fusion (or late fusion) and early fusion. To overcome the classical combination of weighted scores or the obvious concatenation of the observation vectors, several strategies have been studied in a formal methodology: they rely on confidence index (for classes, experts and observations) and on probabilistic and uncertainty theories. First experiments have been performed for automatic language identification where acoustic, phonotactic and prosodic information are merged.

Description

The main goal of our group on that topic is to define tools able to perform the structure analysis on the audio and the video tracks at the same time. To do so, two main approaches are currently explored:

The first one aims at highlighting the existence of temporal relationship between two types of event. An event has to be considered here as a “segment in which a given type of content can be observed” such as a given face, a given speaker, music, a graphical icon, etc. The temporal relation between two segments can be characterized with three numerical parameters. Considering this, each couple of segments produced by two segmentation processes, can be associated with one point in a 3D space. We so proceed to a vote for all the temporal relations which can be found between all the segments automatically identified. The vote distribution in the 3D matrix can then be used to identify different pieces of information such as:
- The association between a voice and a face (corresponding to a same person),
- The players belonging to a same team in a TV game,
- The entertainer identification in a radio program,
- Etc

The second one aims at establishing a style similarity measure between two recordings. The input of this method is any low-level feature associating series of numerical values to an audiovisual content along the temporal dimension. Then local best matching between these values extracted from two different contents are performed with an optimized algorithm. All the quality rates of each match are integrated in a 2D matrix. Here again, the distribution of the higher coefficients in this matrix allows the characterization of two types of similarity:
- Diagonal distributions correspond to a strict similarity. This can be observed when some subparts of the compared documents are exactly the same.
- Distributions in “blocks” correspond locally to a style similarity. The content in the corresponding segments is not exactly the same, but presents some common properties.

From the “similarity matrix” a global similarity measure can be then extracted and used for a document classification task for example.

Sort by the most recent :