High Level Feature Extraction

Jpetiot/ janvier 7, 2009/ Applications


Most of the existing video-search engines rely on context and textual metadata such as the title of the video, tags and comments written by users, etc. In other words, no attempt at understanding the actual content of the video is performed. Content-based audiovisual analysis aims at bridging this so-called semantic gap.


Our approach to this problem aims at being as generic, modular and automatic as possible:

  • Automatic — taking advantage of (and hopefully designing new) data-mining techniques.
  • Modular — when a new content descriptor is available, no system re-design should be necessary — it should just acknowledge its availability and make a (smart) use of it.
  • Generic — the ideal system would adapt to different type of content (TV shows, news broadcast, movies or user-generated personal movies).

High-level feature extraction is one of those tasks where such an approach would be very helpful. As defined in the TRECVid evaluation campaign organized by NIST, given a large set of videos and associated shot boundaries, the objective is to automatically output a list of shots that contain a pre-defined list of semantic concepts (as varied as chair, cityscape and person singing…).

Support vector machine using unbalanced data

collaborative annotation effort allowed to annotate the videos of the whole development set with the 20 semantic concepts — thus opening the door to the use of automatic machine-learning techniques.

However, one main problem in this type of problem is the unbalanced nature of the available training dataset: there are much more negative samples (not containing the concept) than positive ones. Support vector machines, for instance, may be disturbed by this issue.

The figures below show one attempt to automatically select a good sub-set of the original training set to achieve better classification performance.

Iterative removal of support vectors

The idea, here, is to iteratively remove support vectors (black) from the dominant class from the training set so that the boundary between positive (red) and negative (green) samples is not too close to the positive samples.

Fusion of multiple descriptors

In the framework of TRECVid 2009 High Level Feature Extraction task, several descriptors, both audio and visual, were extracted, processed and then fused in order to obtain improved performance over mono-modal systems.

A (not-so-exhaustive) list of descriptors include:

  • Visual descriptors: SIFT and color local descriptors with “bag-of-words” approach, face detection
  • Speech descriptors: MFCC (mel-frequency cepstral coefficients), voicing percentage, 4Hz modulation, …
  • Music descriptors: YIN, vibrato, …
  • Others audio descriptors: zero-crossing rate, audio energy, spectral statistics

As shown in figures below, each descriptor is used to train one SVM which is further used to output a probability of being part of the positive class (i.e. of containing the semantic concept).

Training one SVM classifier per type of descriptor

These probabilities are then concatenated into one global vector on which another final SVM* is trained.

Training an SVM using all "per-descriptor" SVM scores

In practice, when a new video is available for semantic concept detection, all descriptors are extracted, mono-descriptors SVM are applied, scores are concatenated and the final score is given by the SVM* classifier.

Two-steps testing

Main Publications


Share this Post