Automatic character Labelling in videos

Jpetiot/ January 7, 2005/ Analysis


Some experiments made on automatic video summarization showed that the costume feature is one of the most significant clue for the identification of keyframes belonging to some given excerpts. This property is mainly justified by the fact that costumes are attached to the character function in the video document.The approach proposed here consists first in characterizing the region located below a face which should have been detected by another analysis tool (in our case, we used several face detectors including the one available in OpenCV). Then some temporal filtering is performed in order to robustify the detection. Costumes descriptors are stored in a structured database with some pieces of information related with the context and a label. Each time a new face is detected, this database is checked in order to attach the same label to person who has been already met. At the end of the processing, the user can update the index by giving a real name to each label.


First, in order to reduce false face detections, we must exploit the properties of a video sequence by using a temporal approach. For each frame, we detect all the faces using a static approach. Then, we take a temporal window (subsequence) of 2N +1 frames. For each candidate face, we count its number of occurrences in the N previous frames, and in the N next frames. Recall that all these detections are made independently. Then, we keep a candidate face if it appears at least N2 times in this subsequence. In our application, we took N = 2 (which leads to a subsequence of 5 frames) and N2 = 4. We shown that some optimal values can be defined for these two parameters on the base of the intrinsic performances of the face detector.

Let us consider now that a costume is described by a color histogram. To quickly find its location in a frame, an image of weights is created from the frame, which represents the repartition of the most probable pixels to be part of the object. This image of weights is called backprojected image, and is based on the ratio histogram r = min (hf/hc,1). Since the ratio histogram emphasizes the predominant colors of the costume while diminishing the presence of clutter and background colors, the backprojected image represents a spatial measure of the costume presence. From this image of weights, the problem is to find if there is a “group” of likely pixels, and if so to detect it. Considering this image as a cluster in R2, the “group” of pixels can be considered as the cluster global mode. Then, a statistical method, the mean shift procedure, is used to detect it.

modelframeMean shift application after back projection

Doing so, we can improve person detection in television programs of about 2% compared with methods based only on face detection. Using color histograms describing costumes to identify people in videos may lead to very variable rates depending on the content. Applyied on talk shows, we observed an error rate lower than 2%.



Main publications

Share this Post