Dedicated Features for Music Genre Classification

Csenac/ janvier 21, 2020/ Analysis


In the context of Music Genre Classification, we propose to use, as entries of a CNN, a set of eight music features chosen along three main music dimensions: dynamics, timbre and tonality.

With CNNs (Figure 1) trained in such a way that filter dimensions are interpretable in time and frequency, results show that only eight music features are as efficient than 513 frequency bins of a spectrogram and that late score fusion between systems based on both feature types improves accuracy.


Baseline System Spectrogram : FFT on 46.44 ms analysis Hamming window. Output from each frame is a 513-dimensional vector.

Music Features System :

  • Dynamics Feature 
    • Short-term energy: Metal and Classic are highly related with energy.
  • Timbre Features
    • Zero-crossing rate (rate of sign changes of a signal): for percussive or noisy tracks detection.
    • Brightness (amount of energy above a cut off frequency of 15 kHz): for high frequency detection.
    • Spectral Flatness (statistical moment of the power spectrum): for smooth/spiky spectrum detection.
    • Spectral Shannon Entropy (statistical moment of the amount of information contained in the spectrum): for predominant peaks presence detection.
    • Spectral Roughness (statistical moment of the average of all the dissonances between all possible pairs of peaks): appears when two frequencies are very close but not exactly the same, which induces sensory dissonance.
  • Tonality Features
    • Zero-crossing Key Clarity (the key strength associated with the best key(s) (i.e. the peak ordinate(s) in the chromagram). The key strength is a score computed using the chromagram which shows the distribution of energy along the pitches or pitch classes): useful to know if a song is tonal or atonal. Hip Hop has generally a low Key Clarity, whereas Country and Blues tend to have high values.
    • Harmonic Change Detection function (flux of the tonal centroid, which is calculated using chromagram) : represents the chords (groups of notes) played.

Late Fusion :

  • Single System: each entry of a network corresponding to a 3 seconds clip, the network returns a genre decision for each clip. Then the overall genre classification of the piece of music is done by a majority vote on the network outputs provided by the clips that compose it.
  • Fusion of the two Systems: the probabilities of the two networks for each clip and each genre are averaged. Then the decision follows the same scheme as in the case of a single system.


Figure 1- The networks topology: n corresponds to 513 frequency bins or to 8 music features and features are aggregated over 3 seconds with a 50% overlapping


  • Application was made on GTZAN Dataset
    • 10 genres: 10 x 100 x 30s recordings
    • Radio, compact disks, and MP3
    • Format : 22.050 kHz, 16-bit, mono
  • Implementation
    • Python with Theano
    • GPU on NVIDIA Tesla K40
  • Conclusion
    • 8 musical features chosen along dynamics, timbre and tonality dimensions
    • CNNs: filters dimensions interpretable in time and frequency
    • SPECTRO (88%) < 8 MUSIC features (90%) < FUSION (91%)


Main publications

Christine SenacThomas PellegriniFlorian MouretJulien Pinquier Music Feature Maps with Convolutional Neural Networks for Music Genre Classification. (short paper) Dans : International Workshop on Content-Based Multimedia Indexing (CBMI 2017)Florence, Italie19/06/17-21/06/17ACM : Association for Computing Machinery, p. 1-5, juin 2017. Accès : 

Christine SenacThomas PellegriniJulien PinquierFlorian Mouret Réseaux de neurones convolutifs et paramètres musicaux pour la classification en genres (regular paper) Dans : Colloque GRETSI sur le Traitement du Signal et des Images (GRETSI 2017)Juan-les-pins05/09/17-08/09/17GRETSI CNRS, p. 1-5, septembre 2017

Share this Post