phasefinder
phase prediction per best epoch
phasefinder is a beat estimation model that predicts metric position as rotational phase, heavily inspired by this paper.
demos
`are you looking up`
mk.gee
`just the way it is`
action bronson
`lethal weapon`
mike & tony seltzer
`michelle`
the beatles
why phase?
many beat estimation methods use binary classification (0,1) to detect ‘beat presence probability’.
this leaves the model without much context on the cyclical structure of music, i.e. it’s learning to flip a lever up & down at precise times.
in contrast, by estimating phase (the angle 0-360 where 0=beat, 360=next beat), every frame has meaningful information, e.g.
binary | 0 | 0 | 0 | 1 | 0 | 0 | ... |
phase | 312 | 328 | 344 | 0 | 16 | 32 | ... |
instead of tracking onsets or learning the exact number of 0s between each 1, it only needs to look back a few frames to get an idea of the location and speed of the phase.
rather than making discrete choices, it only needs to trace a continous spiral through time.
architecture
input
- audio, sampled at 22.1kHz
- spectrogram with n_fft=2048 and hop=512
- apply log filter, reducing frequency dimension from the original 1025 to 81 log-spaced bands
feature extraction
- three 1d convolutional layers, each followed by max pooling, ELU activation, and dropout
- starts with
num_bands
(81) input channels and maintainsnum_channels
(36) throughout - this setup reduces the frequency dimension while preserving temporal information
tcn
- temporal convolutional network
- consists of
num_tcn_layers
(16) dilated causal convolutions - uses skip connections and layer normalization
- gradually increases the receptive field, allowing the model to capture long-range dependencies
attention
turns out this didn’t really do anything, so I took it out (details)
implements multi-head attention withnum_heads
(4)includes separate linear layers for query, key, and value projectionsincorporates positional encoding to give the model a sense of sequence order
decoder
- two dense layers with dropout in between
- expands from
num_tcn_outputs
to 72, then tonum_classes
(360) - uses ELU activation and ends with a LogSoftmax for phase classification
this architecture allows the model to:
- extract relevant features from the input spectrogram
- capture long-term dependencies in the music
- classify each frame into one of 360 phase angles
hmm
the hidden markov model (hmm) is used to decode the phase space output into a single ‘phase trajectory’.
its transition probabilities are calculated using the bpm (beats per minute) and confidence estimates from deeprhythm
-
state space consists of $N$ discrete phase values, where $N$ is the number of frames in the song.
each state represents a phase angle in the range [0°, 360°]. -
observation sequence is the phase prediction output from the previous stage.
-
transition probabilities are calculated based on the estimated bpm and frame rate. the equation for calculating the expected phase difference is:
$$ \text{expected_phase_diff}_{i,j} = \left((i \cdot \frac{360}{N} + \frac{\Delta\text{phase}}{\text{frame}}) \bmod 360\right) - j \cdot \frac{360}{N} $$
where $i$ and $j$ are the indices of the current and next states, and:
$$ \frac{\Delta\text{phase}}{\text{frame}} = \frac{360 \cdot \text{frame_rate}}{60 \cdot \text{bpm}} $$
-
transition probability matrix is constructed as follows:
$$ A_{i,j} = \begin{cases} 1 - \frac{\Delta{\text{phase} _\text{expected}} _{i,j}}{\text{distance_threshold}}& \text{if } \Delta{\text{phase} _\text{expected}} _{i,j} \leq \text{distance_threshold,} \
& 10 ^{-10} \text{otherwise} \end{cases} $$where $\text{distance_threshold} = 0.1 \cdot \frac{\Delta\text{phase}}{\text{frame}}$
-
bpm confidence is used to adjust the transition probabilities:
$$ A' _{i,j} = \text{bpm_confidence} \cdot A _{i,j} + (1 - \text{bpm_confidence}) \cdot \frac{1}{N} $$
-
viterbi algorithm is used to find the most likely sequence of states (beat positions) given the observations. the recursive step of the viterbi algorithm is:
$$ v_t(j) = \max_i {v_{t-1}(i) + \log A'_{i,j} + \log e_t(j)} $$
where $v_t(j)$ is the viterbi probability for state $j$ at time $t$, $A'_{i,j}$ is the transition probability from state $i$ to state $j$, and $e_t(j)$ is the emission probability for state $j$ at time $t$.
-
backtracking: after computing the viterbi probabilities, the algorithm backtracks to find the most likely sequence of states, which corresponds to the estimated beat positions.
this allows for a robust, continuous estimation of beat positions by incorporating both phase predictions and global tempo information (bpm) while accounting for the uncertainty in the bpm estimate.
postprocessing
after the hmm, we are left with a sequence of phase predictions (0-360) with shape [seq_len]
to get the timestamps, we:
- compute the onset of the phase
- highlights frames where it jumps from ~360° -> ~0° (aka where a beat occurs)
- choose
beat_frames
as all frames whereonset
is greater than 300 - convert from
fft_frame_idx
to time in secondsframe * hop / sample_rate
this leaves us with a pretty solid list of times, but there are usually a few small mistakes, e.g. an extra ‘eighth note’ beat or a couple missing beats.
to clean things up even further, I’ve been testing various methods.
the current process is:
- find the interval mode:
- most common time difference between beats
- it’s a good measure of the overall tempo
- clean up extra beats:
- remove beats where the surrounding beats are within a threshold of the interval mode, suggesting that the beat in between shouldn’t be there
- correct the beat sequence:
- looks at each beat and decides what to do based on how far it is from the interval mode:
- if it’s too close to the last beat (overlap), skip it
- if it’s a bit early, nudge it forward (by
0.5*interval_mode
) - if it’s about right, keep it as is
- if it’s a bit late, nudge it back (by
0.5*interval_mode
) - if it’s way too late, assume we missed a beat (or more) and add them in
- looks at each beat and decides what to do based on how far it is from the interval mode:
importantly, each beat is compared to the last beat added to the result list, not the original input. so if the last beat was moved, the next one is compared to the new time
differences
overall, this model is very similar to the architecture described by Oyama et al above, but there a few changes:
-
increased
num_channels
(in feature extraction / tcn)- 20 => 36
- seemed more aligned with the eventual 360 output classes
-
increased
num_tcn_layers
- 11 => 16
- better vibes idk
-
increased decoder middle channel width
- 64 => 72
- even multiples felt right
- now the frequency dimension goes
81 =>[feature]
=> 36 =>[tcn]
=> 36 =>[dec1]
=> 72 =>[dec2]
=> 360
-
hmm built from scratch, much simpler than the dbn used in the paper
-
dbn
- uses 2d state space representing measure position and tempo
- separate transition models for bar transition and tempo transition
- can model beats and downbeats using measure position
-
hmm
- uses 1d state space representing only beat phase
- single phase transition model, with implicitly encoded tempo that does not change
- only models beat phase
-
data
the model is trained with a 30k subset of the lakh midi dataset, and I generated all the audio as part of superslakh
the beat times are determined using midi source file
(via pretty_midi
)
these times are used to determine the angle [0-360] per fft window frame, which is then converted to a [num_frame, 360] target space
wrapped, blurry one-hots
in training, the ‘target’ is a blurred one-hot of length 360,
- $\mathbf{v}$ is the original one-hot vector of length $n$
- $w$ is the phase width (an odd number)
- $i$ is the index of the non-zero element in the original one-hot vector
- $j$ is the current index in the blurred vector
the blurred vector $\mathbf{b}$ can be defined as:
$$ b_j = \max\left(0, 1 - \frac{2|j-i \bmod n|}{w}\right) $$
where $j \in {0, 1, …, n-1}$
e.g. for phase=1
, the initial one-hot is
0 | 1 | 2 | 3 | 4 | 5 | … | 358 | 359 |
---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | … | 0 | 0 |
once blurred with a phase_width of 7, it becomes
0 | 1 | 2 | 3 | 4 | 5 | … | 358 | 359 |
---|---|---|---|---|---|---|---|---|
0.75 | 1 | 0.75 | 0.5 | 0.25 | 0 | … | 0.25 | 0.5 |
this reinforces the circular nature of the target space,
that 0 & 360 are not opposite but adjacent.
training
the training for this model has taken a meandering path, and I’ve tried all sorts of tweaks and variations.
the basic setup for each run is:
-
batch size of 1
- each item is a full song, and each song is a different length
- rather than trying to pad/stack/unpad, I decided to keep it simple and just train on one at a time
-
KLDiv
loss- batchmean reduction
-
Adam
optimizer- linear LR warmup of 5 epochs (20% -> full)
- reduce LR by half after 5 epochs w/o improvement
evaluation (in progress)
once the overall structure was established and I knew that it worked, I started running tests to determine the optimal parameters.
phase width
first, I ran 5 independent runs with various phase widths.
parameters:
- max 20 epochs
- learning rate 1e-3 (with warmup)
target width | 3 | 5 | 7 | 9 | 11 |
---|---|---|---|---|---|
best epoch | 17 | 11 | 13 | 17 | 10 |
f-measure | 0.83 | 0.827 | 0.839 | 0.833 | 0.832 |
cmlt | 0.8 | 0.796 | 0.807 | 0.804 | 0.798 |
amlt | 0.872 | 0.862 | 0.874 | 0.871 | 0.86 |
as shown, the pulse width had a clear effect on the magnitude of loss (bigger target = smaller loss), but it didn’t seem to have much of an effect on the predictive accuracy of the model.
attention
next, I tried removing attention.
parameters:
- epochs 0-19
- learning rate 1e-3 (with warmup)
- pulse width 7
- epochs 20-39
- learning rate 5e-4 (no warmup)
- pulse width 7
attention | no attention | |
---|---|---|
best epoch | 39 | 32 |
f-measure | 0.848 | 0.855 |
cmlt | 0.82 | 0.819 |
amlt | 0.878 | 0.879 |
these results were suprising, as I thought attention would have a larger effect on the accuracy.
we can see that while the attention helped slightly with validation loss in the latter half of the test, it had little effect on the accuracy of the model (actually performing slightly worse overall).
my current theory on this is that the problem just isn’t hard enough to benefit from the added complexity, i.e. the audio data is not nuanced enough to contain ‘key words’, so the attention module learns to leave the tcn output mostly unchanged (don’t feel like testing this)
on one hand, I was a little disappointed because it seemed like a good idea, and initial tests seemed promising.
on the other, the attention module added quite a bit of heft, and in its absence, the model uses far less memory.
(training w/ attention: ~20gb vram - 15it/s, training w/o attention: ~1gb vram, 25it/s)
learning rate
on both previous tests, I noticed that the accuracy metrics tended to jump around a lot.
my intuition was that the learning rate (1e-3) was too high, forcing it to take big steps that over/undershot the target.
so, I set up a few runs to test various rates (1e-5, 1e-4, 1e-3)
note: the original paper used 1e-3
parameters:
- 60 epochs
- pulse width 7
lr | 1e-5 | 5e-5 | 1e-4 | 5e-4 | 1e-3* |
---|---|---|---|---|---|
best epoch | 52 | 37 | 51 | 40 | 10 |
f-measure | 0.807 | 0.853 | 0.849 | 0.854 | 0.837 |
cmlt | 0.772 | 0.821 | 0.819 | 0.824 | 0.802 |
amlt | 0.853 | 0.874 | 0.880 | 0.875 | 0.850 |
* in progress
evidently, the learning rate does not matter as much as I thought.
it looks like most runs converged pretty quickly on the same general trend (except for 1e-5
which seemingly hit a wall at about f=~0.8)
interestingly, 1e-4
and 5e-4
appear to be tracing the exact same peaks and valleys in the loss chart from epoch 45-60
5e-4
looks like a good balance between stable and explorative, so I’ll be using that going forward.
postproc
should’ve done this one way sooner, but I tested putting the ‘phase targets’ through the hmm/cleaner and it scored 0.908.
this means that even if the model outputs the targets perfectly, it’s ‘accuracy ceiling’ is 0.908 (unless I fix the postproc)
so let’s try to fix the postproc.
first of all, there are a lot of ‘magic number’ parameters, especially in the cleaner functions. so we’ll do a grid search over various options for the parameters:
- hmm bpm confidence
- hmm distance threshold
- clean_beats interval threshold
- interval mode threshold
- correct_beats ‘break points’
- overlap
- early
- late
- missed
- correct_beats nudge amount
now, with 4 options per parameter, this grid has 262,144 combos, which is too many to check exhaustively.
so instead, I generated each combination of parameters, put them in a list, and shuffled it.
now, I can watch the results as they come in, and get a decent estimate of the ‘shape’ of the data, honing in the param options as needed.
by sampling the random distribution across all variables, then plotting [var] vs f-measure for each parameter, I can approximate the effect of each variable.
thus, it becomes clear which variables have an effect on the overall accuracy (early/late beat threshold), and which do not (bpm confidence)
params | original | optimized |
---|---|---|
f-measure | 0.854 | 0.881 |
cmlt | 0.822 | 0.847 |
amlt | 0.881 | 0.891 |
[currently in progress]