Features Extractions

Jpetiot/ December 19, 2004/ Analysis


The objective of the feature extraction step is to capture the most relevant and discriminate characteristics of the signal to recognize. Although characters in the same class have some variances due to different spoken styles for different users, there must exist some consistencies. That is why feature extraction is needed and those features extracted are put into a classifier.

The features used in our speech / music classification system are:


Cepstral coefficients

Cepstral coefficients are the coefficients of the fourier transform representation of the log magnitude spectrum (see [Digital Signal Processing, 1974] for more details).

Spectral coefficients

One of the more common techniques of studying a speech signal is via the power spectrum. The power spectrum of a speech signal describes the frequency content of the signal over time (see [Scheirer, 1997] for more details).

4 Hz modulation energy

Speech signal has a characteristic energy modulation peak around the 4 Hz syllabic rate. In order to model this property, the classical procedure is applied: the signal is segmented in 16 ms frames. Mel Frequency Spectrum Coefficients are extracted and energy is computed in 40 perceptual channels. This energy is then filtered with a FIR band pass filter, centered on 4 Hz. Energy is summed for all channels, and normalized by the mean energy on the frame. The modulation is obtained by computing the variance of filtered energy in dB on one second of signal. Speech (Figure 1a) carries more modulation energy than music (Figure 1b).

Figure 1a: 4 Hz energy modulation for speechFigure 1b: 4 Hz energy modulation for music

Entropy modulation

Music appears to be more ordered than speech considering observations of both signals and spectrograms. To measure this disorder, we evaluate a feature based on signal entropy:

speech entropy form

The signal is segmented in 16 ms frames, the entropy is computed on every frame. This measure is used to compute the entropy modulation on one second of signal. Entropy modulation is higher for speech (Figure 2a) than for music (Figure 2b) with signals of the previous paragraph.

Figure 2a: entropy modulation for speechFigure 2b: entropy modulation for music

Number of segments

The duration feature is the consequence of the application of the segmentation algorithm. The speech signal is composed of alternate periods of transient and steady parts (steady parts are mainly vowels). Meanwhile, music is more constant, that is to say the number of changes (segments) will be greater for speech (Figure 3a) than for music (Figure 3b). To estimate this feature, we compute the number of segments on one second of signal.

Figure 3a: segmentation and spectrogram on about 1 second of speechFigure 3b: segmentation and spectrogram on about 1 second of music

Segment duration

The segments (see segmentation algorithm ) are generally longer for music (Figure 3b) than for speech (Figure 3a). We have decided to model the segment duration by a Gaussian Inverse law (Wald law).

The probability density function (pdf) is given by:

inverse gaussian law


Julien Pinquier (contact)

Régine André-Obrecht (contact)

Jean-Luc Rouas



Main Publications

Julien PinquierJean-Luc RouasRégine André-ObrechtA Fusion Study in Speech/Music Classification. In : International Conference on Acoustics, Speech and Signal Processing (ICASSP’2003)Hong Kong, China06/04/03-10/04/03, Vol. II, IEEE, p. 17-20, avril 2003.

Julien PinquierIndexation sonore : recherche de composantes primaires pour une structuration audiovisuelle. Thèse de doctorat, Université Paul Sabatier, décembre 2004.


Oppenheim A. and Schafer R., Digital Signal Processing, Upper Saddle River, NJ: Prentice-Hall, 1974.

Scheirer E. and Slaney M., Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator, ICASSP’97, Munich, 1997.

Share this Post