Our contribution is the introduction of costume as a feature for automatic content video indexing. In order to show the interest of costume in indexing, we propose an automatic application using costumes, which allows the recognition of all the persons present in a video, and the detection of each occurrence of each character. Experiments have been carried out on video sequences of various TV programs. Experimental results show good performances and may be improved by a more accurate costume extractor in further experiments.
This presentation is organized as follows. Our application is presented in section 2. In section 3, we briefly present the face detection method we used. Section 4 presents the extraction of costumes, which will be processed in section 5. Finally, section 6 presents experimental results.
The application described in here is semi-automatic: the first time a character appears, the user must give his name, and then the detection of reappearance becomes automatic. We could change it to an automatic one if for each new costume, we would store it without asking the user, and give the different appearances of each (anonymous) character.
Our algorithm is structured in three parts, and is applied on every frame of video sequences. We use a database of labelled costumes, which can initially be empty. First, a face detection is run, so as to detect the different possible characters who are present in the current frame, and their approximate position and scale. Then, the costume of each character is extracted from the image according to the location and the scale of his face. Finally, we compare the features extracted from the costume to those of the database. If one costume corresponds, then the character is recognized. Else, the user is asked to give the name of the character, and the new costume is added in the database. This algorithm is summed up in Fig.1.
We used the method presented in [10] and improved in [11], because a fast implementation is available in the Intel library OpenCV [12]. We do not explain this method in details, because we only use this algorithm as a black box. If we would replace it by another one, obviously the results would differ, but the approach of our application would remain the same.
Our algorithm of costume localization is based upon face detection. However, frame by frame face localization introduce many false alarms, due to some noise present in the data. Only one false detection in a frame is enough to involve a false alarm on costume detection. An example is illustrated in Fig.2.
|
In order to reduce these false detections, we must exploit the properties of a video sequence by using a temporal approach. The use of temporal information was proposed in [13] for robust face tracking, with the CONDENSATION algorithm [14] for prediction over time. As our problem is not face tracking in a shot, we do not use the same approach: we propose a ``smoothing'' over time of face detection by using a temporal window.
For each frame, we detect all the faces using a static approach
(at the moment we use the algorithm proposed in [11]).
Then, we take a temporal window (subsequence) of
frames.
For each candidate face, we count its number of occurrences in the
previous frames, and in the
next frames.
Recall that all these detections are made independently.
Then, we keep a candidate face if it appears at least
times in this subsequence. In our application, we took
(which leads to a subsequence of 5 frames) and
.
We consider that two detected faces correspond to the same face if there are roughly at the same location. The position parameters may slightly vary considering camera works or character motions. So, a small variation of these parameters is borne to take into account these effects. Moreover, to avoid the detection of faces in dissolves, as presented in Fig.3, we consider that two faces correspond to the same face if the costumes detected from these faces are also identical (in terms of features). An example of results is given in Fig.4.
|
|
|
|
So as to deal with these cases,
we also used the localization method proposed
in [15], which is not based
on face localization.
This method allows the detection in an image of all the objects
which correspond to a color model, without a priori information
about their number.
First, a classification of the
pixels is done: from an object model, a new image is created,
where each pixel represents a membership measure to the model.
This image represents the repartition of the most probable pixels
to be part of the searched object.
This approach consists in considering this binary image as a cluster
in
:
by using the values of the image as weights associated
with each pixel location, the task of object localization
reduces to the detection of local modes in the cluster.
This search is carried out by applying a statistical method: the
mean shift procedure [16].
Each mode is then associated with an object, which corresponds to the model
(a mode is a local density maximum).
This method needs a prior information about the scale of the
object to search. To give up this information, we run this algorithm
several times, which different scales, and keep the scale which provides
the best coefficient.
Moreover, we did not maximize the density estimate, as expected in
[15], but the Bhattacharyya coefficient
(cf. section 5.1).
However, this blind approach for costume detection needs a too long computational time. Hence, it has to be applied for each model of costume, with many different scales. Although we reduced the computational time by using a simple heuristic (only searching costumes for which the histogram intersection [17] with the image histogram is above a threshold), this method takes more than one second per frame, which is too slow for real-time processing. In the sequel, we will detect the new occurrences of each costume using only the face-based detection, because it significantly reduces the computational time for costume search.
In our experiments, we used the Bhattacharyya coefficient, which is closely related to the Bayes error [18, p. 38]. The general form (derived from the Bayes error) is
![]() |
(1) |
![]() |
(2) |
We used other similarity measures computed from color histograms
(histogram intersection [17],
, and
correlation measures), but the best results are obtained with
the Bhattacharyya coefficient.
First, we tried the HSV (Hue-Saturation-Value) system. For computational time, we did not use the standard conversion formula [23], but the approximate one
![]() |
The same remarks can be made with the perceptually uniform L*a*b* system [22, p. 167]. Using the three components, the results are approximatively the same (apart from the computational time). However, when we use only two components so as to deal with illumination changes, the results become weaker. Those experiments were made essentially to remove effects of lightning variations during the recognition process. Getting back to the RGB color space induces less light variation filtering, and so may increase the false detection rate. Actually, our method is devoted to TV talk-shows indexing for now and in that kind of content, we can observe some really stable conditions of shooting with no variation of the global illumination. This is the main reason why, on that kind of content, the RGB color space provides better results than the other ones. Hence in the sequel, we only use the RGB color system.
|
During the various runs of the application, we measured the computational time, as well as the number of human interventions for the semi-automatic approach.
First, we ran it on a part of seven minutes of a TV game, which contains 10623 frames. The frames were processed at the rate of 13 fps, on a 1.7 Ghz PC, with a C implementation, without any special optimization. In this sequence, the application succeeded in detecting the five main characters (one speaker and four candidates). The user had to type the name of the detected character 16 times, among which 8 for the audience, and one ``I'' (to ignore a detection). Two main characters needed two models, because of partial occlusion and change of the scale (an example is given in Fig.8). Most of the user entries were made in the first two minutes, when each character appears for the first time. Then, the number of user requirements decreases: only the audience detection and some few failures in recognition remain, when the extracted costume was already found, but is too different from the one in the database, as shown in Fig.8. Hence, we could let the application be semi-automatic during some minutes, and then afterwards let it be automatic, by ignoring the unknown new characters.
|
Afterwards, we tried our application on a TV detective film. The biggest problem is that many characters wear the same suit. So, the system can detect an appearance, but only of a person wearing this suit, it cannot recognize him. This case will be discussed in section 7.1. Moreover, unlike the talk-shows, the characters may change their clothes from one shot to the next. In this case, the manual intervention is needed for each new costume worn. Finally, each costume need more models than for TV talk-shows. The average number of models needed for a costume was less than 2 for the talk-shows, but it is approximatively 3 or 4 for this movie, due to different changes in lightness, contrast, indoor/outdoor sequences, ...
After execution of the application, we updated the index by giving a name for each detected character, in order to compare them with the ones of the manual index. The results are given in Table 1 and Table 2. At the end of the processing, the database contains 42 costumes. This sequence was processed at the rate of 5.33 frames per second. The recognition rate for the characters who belong to the first class is pretty good: recall this rate takes into account all the first class characters of all the frames of the whole sequence. The lack of our application for detecting second and third class characters is foreseeable, since the character detection is based on a face detection, which usually fails with those classes. We can note that computational time is bigger than the one of the test with the semi-automatic approach: this will be explained in section 6.3.
| Class | Number of characters | Number of recognized characters |
| 1 | 19692 | 16508 (83.83 %) |
| 2 | 34978 | 696 (1.99 %) |
| 3 | 56857 | 2724 (4.79 %) |
| Number of false alarms | 12 |
| Number of misclassified characters | 55 (0.32%) |
| Number of non-detection in class 1 | 3184 (16.17 %) |
| Number of non-detection in class 2 | 34282 (98.01 %) |
| Number of non-detection in class 3 | 54133 (95.21 %) |
In order to avoid this kind of miss-detections, and these useless processing, we used a shot-based approach. In the beginning of a shot, we run the classical method, but each character who is detected is supposed to appear in all the frames of the shot. Moreover, when we detect a first class character, we stop the processing until the end of the shot. Although this approach gives more false alarms, the number of miss-detections becomes weaker, and the computational time becomes lower than the real-time one. The numerical results are detailed in Tables 3, 4 and 5.
| Class | Number of characters | Number of recognized characters |
| 1 | 19692 | 17823 (90.51 %) |
| 2 | 34978 | 2191 (6.26 %) |
| 3 | 56857 | 3727 (6.56 %) |
| Number of false alarms | 135 |
| Number of misclassified characters | 1488 (7.43 %) |
| Number of non-detection in class 1 | 1869 (9.49 %) |
| Number of non-detection in class 2 | 32787 (93.74 %) |
| Number of non-detection in class 3 | 53130 (93.44 %) |
However, the use of costume as the only feature for indexing can produce bad results when many characters wear identical clothes. In this case, this feature can be used in addition of another feature, to achieve a reliable performance for character identification. Let's remind here that considering the primitive task of costume detection and identification, detecting identical clothes, even worn by different characters, does make sense. Furthermore, this kind of clue can be of interest to identify roles or characters associated to a same "corporation" in a document: for instance, in the movie ``Men in Black'', all the characters who wear the same black suit belong to the same organization, in a detective film the policemen can wear an blue uniform, ...
To evaluate this tool, we will have to propose two protocols: one to evaluate the ability of this tool to identify characters and one for costumes identification. Obviously, the goals are not the same and maybe the tool will have to be slightly optimized considering the task to be achieved. Those protocols will have also to deal some other problems such as the ground truth production: what shall be done when several persons are present on the screen at the same time? How the audience shall be considered? How recognition of characters among the audience shall be taken into account? These kinds of situations are not so rare in TV talk-shows and are problems we will have to deal with.
First, we want to try it with a more and a less robust face detector. For now, the detector we use detects only frontal views. We would like to combine it with a profile view detector [13]. This approach would avoid all the miss-detections due to characters who are not looking at the camera. We will also use other approaches for costume extraction. Instead of taking only the color histogram of the area under the face, we would like to use a texture distribution, in order to keep spatial informations. Moreover, we will add a weight to every pixel in the selected area [17,21], so as to reduce the background influence. Finally, we want to separate a costume in different parts: tie, jacket, hat, trousers, ...in order to study the interest of costume for character function, i.e. the information that costume brings to the document, and to the role of the characters.