Dedicated Features for Music Genre Classification
Context
In the context of Music Genre Classification, we propose to use, as entries of a CNN, a set of eight music features chosen along three main music dimensions: dynamics, timbre and tonality.
With CNNs (Figure 1) trained in such a way that filter dimensions are interpretable in time and frequency, results show that only eight music features are as efficient than 513 frequency bins of a spectrogram and that late score fusion between systems based on both feature types improves accuracy.
Features
Baseline System – Spectrogram : FFT on 46.44 ms analysis Hamming window. Output from each frame is a 513-dimensional vector.
Music Features System :
- Dynamics Feature
- Short-term energy: Metal and Classic are highly related with energy.
- Timbre Features
- Zero-crossing rate (rate of sign changes of a signal): for percussive or noisy tracks detection.
- Brightness (amount of energy above a cut off frequency of 15 kHz): for high frequency detection.
- Spectral Flatness (statistical moment of the power spectrum): for smooth/spiky spectrum detection.
- Spectral Shannon Entropy (statistical moment of the amount of information contained in the spectrum): for predominant peaks presence detection.
- Spectral Roughness (statistical moment of the average of all the dissonances between all possible pairs of peaks): appears when two frequencies are very close but not exactly the same, which induces sensory dissonance.
- Tonality Features
- Zero-crossing Key Clarity (the key strength associated with the best key(s) (i.e. the peak ordinate(s) in the chromagram). The key strength is a score computed using the chromagram which shows the distribution of energy along the pitches or pitch classes): useful to know if a song is tonal or atonal. Hip Hop has generally a low Key Clarity, whereas Country and Blues tend to have high values.
- Harmonic Change Detection function (flux of the tonal centroid, which is calculated using chromagram) : represents the chords (groups of notes) played.
Late Fusion :
- Single System: each entry of a network corresponding to a 3 seconds clip, the network returns a genre decision for each clip. Then the overall genre classification of the piece of music is done by a majority vote on the network outputs provided by the clips that compose it.
- Fusion of the two Systems: the probabilities of the two networks for each clip and each genre are averaged. Then the decision follows the same scheme as in the case of a single system.
Networks

Application
- Application was made on GTZAN Dataset
- 10 genres: 10 x 100 x 30s recordings
- Radio, compact disks, and MP3
- Format : 22.050 kHz, 16-bit, mono
- Implementation
- Python with Theano
- GPU on NVIDIA Tesla K40
- Conclusion
- 8 musical features chosen along dynamics, timbre and tonality dimensions
- CNNs: filters dimensions interpretable in time and frequency
- SPECTRO (88%) < 8 MUSIC features (90%) < FUSION (91%)
Contributors
- Christine Sénac (contact)
- Julien Pinquier
- Florian Mouret
- Thomas Pellegrini
Main publications
Christine Senac, Thomas Pellegrini, Florian Mouret, Julien Pinquier Music Feature Maps with Convolutional Neural Networks for Music Genre Classification. (short paper) Dans : International Workshop on Content-Based Multimedia Indexing (CBMI 2017), Florence, Italie, 19/06/17-21/06/17, ACM : Association for Computing Machinery, p. 1-5, juin 2017. Accès : https://doi.org/10.1145/3095713.3095733
Christine Senac, Thomas Pellegrini, Julien Pinquier, Florian Mouret Réseaux de neurones convolutifs et paramètres musicaux pour la classification en genres (regular paper) Dans : Colloque GRETSI sur le Traitement du Signal et des Images (GRETSI 2017), Juan-les-pins, 05/09/17-08/09/17, GRETSI CNRS, p. 1-5, septembre 2017