deeprhythm
github
pypi
deeprhythm is a convolutional neural network designed for rapid, precise tempo prediction for modern music. it runs on anything that supports pytorch
(I’ve tested ubuntu, macos, windows, raspbian).
audio is batch-processed using a vectorized harmonic constant-q modulation (hcqm), drastically reducing computation time by avoiding the usual bottlenecks encountered in feature extraction.
why?
I needed a way to accurately estimate tempo that was small & fast enough run on a raspberry pi.
I tried the estimators from librosa
, essentia
, and tempocnn
, but all (open source) methods were unreliable and very slow.
so, I did some research and found this paper, describing a cnn that predicts tempo using an audio feature they call ‘harmonic constant-q modulation’.
in short, they perform a series of constant-q transforms over an 8s window, and rather than extracting the usual ‘pitch’ frequencies, they extract much lower ’tempo’ frequencies, e.g.
freq range (Hz) | default cqt | hcqm |
---|---|---|
max | 9397.27 | 4.76 (286bpm) |
min | 32.70 | 0.5 (30bpm) |
when computed across 8 frequency bands & 6 harmonics, it results in a 3d tensor that represents the relative strength of the tempos present in the audio.
this ‘tempo cube’ can then be fed into a convolutional neural net that performs classification over 256 output classes (30-286bpm).
with this approach, the cnn barely has to do any legwork, and it functions more as a filter to reduce the dimensionality of the hcqm, i.e. instead of learning onset patterns, it just needs to interpret the relative strength of the bpm frequencies themselves.
anyways, I couldn’t find any source code for the paper, nor any hcqm implementations, so I built my own. here’s how it works.
hcqm
hcqm process for a single clip
-
load a song
[duration*sample_rate]
-
chop and stack into 8s clips
[clips (duration//8), len_clip (8*sample_rate)]
-
compute stft on slices
[clips, fft_bins (1+n_fft/2), fft_frames (len_clip/hop)]
-
compress into log-spaced bands
[clips, bands (8), fft_frames]
-
flatten into batch of band-signals
[clips*bands, fft_frames]
-
compute onset strength of band-signals
[clips*bands, fft_frames]
-
compute cqt (per harmonic)
[harmonics (6), clips*bands, cqt_bins (240), cqt_frames (1)]
-
reshape
[clips, cqt_bins, bands, harmonics]
pre-process
initially, I used librosa
to compute the hcqms (as mentioned in the paper), but I needed to compute 20-40 clips per song, for several thousand songs, which would’ve taken days.
so I rewrote the hcqm implementation with pytorch
and nnAudio
to run on the gpu instead.
the kernels / filters for each step are pre-computed and re-used, and with some careful flattening & reshaping it was possible to batch-process 50-100 songs at once, in less than a second (on a 4090).
architecture
cnn architecture, from Foroughmand & Peeters
the architecture used is the same as the original paper, a pretty straightforward cnn.
- conv1
- input channels: 6
- output channels: 128
- kernel size: (4, 6)
- followed by batchnorm2d and relu activation
- conv2
- input channels: 128
- output channels: 64
- kernel size: (4, 6)
- followed by batchnorm2d and relu activation
- conv3
- input channels: 64
- output channels: 64
- kernel size: (4, 6)
- followed by batchnorm2d and relu activation
- conv4
- input channels: 64
- output channels: 32
- kernel size: (4, 6)
- followed by batchnorm2d and relu activation
- conv5
- input channels: 32
- output channels: 8
- kernel size: (120, 6)
- followed by batchnorm2d and relu activation
- fc1
- input features: 2904
- output features: 256
- followed by elu activation and dropout (0.5)
- fc2 (output layer)
- input features: 256
- output features: num_classes (default 256)
data
for training data, I used giantsteps
, ballroom
, slakh2100
, and a small subset of fma
.
I ran the whole set through every other tempo predictor I could find, chose a ‘true bpm’ by majority vote, then measured confidence as avg. distance between each predictor and the ‘winner’.
this allowed me to significantly clean up the dataset by removing multi-tempo / ambiguous / accoustic songs that cannot be predicted accurately
training
the core setup was:
- batches of 256 (8 second clips)
CrossEntropyLoss
Adam
optimizer- learning rate 1e-5
- halves lr after 2 epochs no improvement
- quits after 5 epochs no improvement
evaluation
method | acc1 (%) | acc2 (%) | avg. time (s) | total time (s) |
---|---|---|---|---|
deeprhythm (cuda) | 95.91 | 96.54 | 0.021 | 20.11 |
deeprhythm (cpu) | 95.91 | 96.54 | 0.12 | 115.02 |
tempocnn (cnn) | 84.78 | 97.69 | 1.21 | 1150.43 |
tempocnn (fcn) | 83.53 | 96.54 | 1.19 | 1131.51 |
essentia (multifeature) | 87.93 | 97.48 | 2.72 | 2595.64 |
essentia (percival) | 85.83 | 95.07 | 1.35 | 1289.62 |
essentia (degara) | 86.46 | 97.17 | 1.38 | 1310.69 |
librosa | 66.84 | 75.13 | 0.48 | 460.52 |
- test done on 953 songs, mostly electronic, hip hop, pop, and rock
- acc1 = Prediction within +/- 2% of actual bpm
- acc2 = Prediction within +/- 2% of actual bpm or a multiple (e.g. 120 ~= 60)
- timed from filepath in to bpm out (audio loading, feature extraction, model inference)
- I could only get tempocnn to run on cpu (it requires cuda 10)