bleugreen lab

phasefinder

github pypi

Phasefinder Learning Process

phase prediction per best epoch

phasefinder is a beat estimation model that predicts metric position as rotational phase, heavily inspired by this paper.

demos


`are you looking up`
mk.gee

`just the way it is`
action bronson

`lethal weapon`
mike & tony seltzer

`michelle`
the beatles


why phase?

many beat estimation methods use binary classification (0,1) to detect ‘beat presence probability’.

this leaves the model without much context on the cyclical structure of music, i.e. it’s learning to flip a lever up & down at precise times.

in contrast, by estimating phase (the angle 0-360 where 0=beat, 360=next beat), every frame has meaningful information, e.g.

binary 0 0 0 1 0 0 ...
phase 312 328 344 0 16 32 ...

instead of tracking onsets or learning the exact number of 0s between each 1, it only needs to look back a few frames to get an idea of the location and speed of the phase.

rather than making discrete choices, it only needs to trace a continous spiral through time.

architecture

Project 1 Image

input

feature extraction

tcn

attention

turns out this didn’t really do anything, so I took it out (details)

decoder

this architecture allows the model to:

  1. extract relevant features from the input spectrogram
  2. capture long-term dependencies in the music
  3. classify each frame into one of 360 phase angles

hmm

the hidden markov model (hmm) is used to decode the phase space output into a single ‘phase trajectory’.

its transition probabilities are calculated using the bpm (beats per minute) and confidence estimates from deeprhythm

  1. state space consists of $N$ discrete phase values, where $N$ is the number of frames in the song.
    each state represents a phase angle in the range [0°, 360°].

  2. observation sequence is the phase prediction output from the previous stage.

  3. transition probabilities are calculated based on the estimated bpm and frame rate. the equation for calculating the expected phase difference is:

    $$ \text{expected_phase_diff}_{i,j} = \left((i \cdot \frac{360}{N} + \frac{\Delta\text{phase}}{\text{frame}}) \bmod 360\right) - j \cdot \frac{360}{N} $$

    where $i$ and $j$ are the indices of the current and next states, and:

    $$ \frac{\Delta\text{phase}}{\text{frame}} = \frac{360 \cdot \text{frame_rate}}{60 \cdot \text{bpm}} $$

  4. transition probability matrix is constructed as follows:

    $$ A_{i,j} = \begin{cases} 1 - \frac{\Delta{\text{phase} _\text{expected}} _{i,j}}{\text{distance_threshold}}& \text{if } \Delta{\text{phase} _\text{expected}} _{i,j} \leq \text{distance_threshold,} \
    & 10 ^{-10} \text{otherwise} \end{cases} $$

    where $\text{distance_threshold} = 0.1 \cdot \frac{\Delta\text{phase}}{\text{frame}}$

  5. bpm confidence is used to adjust the transition probabilities:

    $$ A' _{i,j} = \text{bpm_confidence} \cdot A _{i,j} + (1 - \text{bpm_confidence}) \cdot \frac{1}{N} $$

  6. viterbi algorithm is used to find the most likely sequence of states (beat positions) given the observations. the recursive step of the viterbi algorithm is:

    $$ v_t(j) = \max_i {v_{t-1}(i) + \log A'_{i,j} + \log e_t(j)} $$

    where $v_t(j)$ is the viterbi probability for state $j$ at time $t$, $A'_{i,j}$ is the transition probability from state $i$ to state $j$, and $e_t(j)$ is the emission probability for state $j$ at time $t$.

  7. backtracking: after computing the viterbi probabilities, the algorithm backtracks to find the most likely sequence of states, which corresponds to the estimated beat positions.

this allows for a robust, continuous estimation of beat positions by incorporating both phase predictions and global tempo information (bpm) while accounting for the uncertainty in the bpm estimate.

postprocessing

after the hmm, we are left with a sequence of phase predictions (0-360) with shape [seq_len]

to get the timestamps, we:

  1. compute the onset of the phase
    • highlights frames where it jumps from ~360° -> ~0° (aka where a beat occurs)
  2. choose beat_frames as all frames where onset is greater than 300
  3. convert from fft_frame_idx to time in seconds
    • frame * hop / sample_rate

this leaves us with a pretty solid list of times, but there are usually a few small mistakes, e.g. an extra ‘eighth note’ beat or a couple missing beats.

to clean things up even further, I’ve been testing various methods.

the current process is:

  1. find the interval mode:
    • most common time difference between beats
    • it’s a good measure of the overall tempo
  2. clean up extra beats:
    • remove beats where the surrounding beats are within a threshold of the interval mode, suggesting that the beat in between shouldn’t be there
  3. correct the beat sequence:
    • looks at each beat and decides what to do based on how far it is from the interval mode:
      • if it’s too close to the last beat (overlap), skip it
      • if it’s a bit early, nudge it forward (by 0.5*interval_mode)
      • if it’s about right, keep it as is
      • if it’s a bit late, nudge it back (by 0.5*interval_mode)
      • if it’s way too late, assume we missed a beat (or more) and add them in

importantly, each beat is compared to the last beat added to the result list, not the original input. so if the last beat was moved, the next one is compared to the new time

differences

overall, this model is very similar to the architecture described by Oyama et al above, but there a few changes:

data

the model is trained with a 30k subset of the lakh midi dataset, and I generated all the audio as part of superslakh

the beat times are determined using midi source file (via pretty_midi)

these times are used to determine the angle [0-360] per fft window frame, which is then converted to a [num_frame, 360] target space

wrapped, blurry one-hots

in training, the ‘target’ is a blurred one-hot of length 360,

the blurred vector $\mathbf{b}$ can be defined as:

$$ b_j = \max\left(0, 1 - \frac{2|j-i \bmod n|}{w}\right) $$

where $j \in {0, 1, …, n-1}$

e.g. for phase=1, the initial one-hot is

0 1 2 3 4 5 358 359
0 1 0 0 0 0 0 0

once blurred with a phase_width of 7, it becomes

0 1 2 3 4 5 358 359
0.75 1 0.75 0.5 0.25 0 0.25 0.5

this reinforces the circular nature of the target space,
that 0 & 360 are not opposite but adjacent.

One-Hot Plots

training

the training for this model has taken a meandering path, and I’ve tried all sorts of tweaks and variations.

the basic setup for each run is:

evaluation (in progress)

once the overall structure was established and I knew that it worked, I started running tests to determine the optimal parameters.

phase width

first, I ran 5 independent runs with various phase widths.

parameters:

target width 3 5 7 9 11
best epoch 17 11 13 17 10
f-measure 0.83 0.827 0.839 0.833 0.832
cmlt 0.8 0.796 0.807 0.804 0.798
amlt 0.872 0.862 0.874 0.871 0.86

Pulse Width Chart

as shown, the pulse width had a clear effect on the magnitude of loss (bigger target = smaller loss), but it didn’t seem to have much of an effect on the predictive accuracy of the model.

attention

next, I tried removing attention.

parameters:

attention no attention
best epoch 39 32
f-measure 0.848 0.855
cmlt 0.82 0.819
amlt 0.878 0.879

Attention Test Chart

these results were suprising, as I thought attention would have a larger effect on the accuracy.

we can see that while the attention helped slightly with validation loss in the latter half of the test, it had little effect on the accuracy of the model (actually performing slightly worse overall).

my current theory on this is that the problem just isn’t hard enough to benefit from the added complexity, i.e. the audio data is not nuanced enough to contain ‘key words’, so the attention module learns to leave the tcn output mostly unchanged (don’t feel like testing this)

on one hand, I was a little disappointed because it seemed like a good idea, and initial tests seemed promising.

on the other, the attention module added quite a bit of heft, and in its absence, the model uses far less memory.

(training w/ attention: ~20gb vram - 15it/s, training w/o attention: ~1gb vram, 25it/s)

learning rate

on both previous tests, I noticed that the accuracy metrics tended to jump around a lot.

my intuition was that the learning rate (1e-3) was too high, forcing it to take big steps that over/undershot the target.

so, I set up a few runs to test various rates (1e-5, 1e-4, 1e-3)

note: the original paper used 1e-3

parameters:

lr 1e-5 5e-5 1e-4 5e-4 1e-3*
best epoch 52 37 51 40 10
f-measure 0.807 0.853 0.849 0.854 0.837
cmlt 0.772 0.821 0.819 0.824 0.802
amlt 0.853 0.874 0.880 0.875 0.850

* in progress

LR Test Chart

evidently, the learning rate does not matter as much as I thought.

it looks like most runs converged pretty quickly on the same general trend (except for 1e-5 which seemingly hit a wall at about f=~0.8)

interestingly, 1e-4 and 5e-4 appear to be tracing the exact same peaks and valleys in the loss chart from epoch 45-60

5e-4 looks like a good balance between stable and explorative, so I’ll be using that going forward.

postproc

should’ve done this one way sooner, but I tested putting the ‘phase targets’ through the hmm/cleaner and it scored 0.908.

this means that even if the model outputs the targets perfectly, it’s ‘accuracy ceiling’ is 0.908 (unless I fix the postproc)

so let’s try to fix the postproc.

first of all, there are a lot of ‘magic number’ parameters, especially in the cleaner functions. so we’ll do a grid search over various options for the parameters:

now, with 4 options per parameter, this grid has 262,144 combos, which is too many to check exhaustively.

so instead, I generated each combination of parameters, put them in a list, and shuffled it.

now, I can watch the results as they come in, and get a decent estimate of the ‘shape’ of the data, honing in the param options as needed.

by sampling the random distribution across all variables, then plotting [var] vs f-measure for each parameter, I can approximate the effect of each variable.

Postproc grid plots

thus, it becomes clear which variables have an effect on the overall accuracy (early/late beat threshold), and which do not (bpm confidence)

params original optimized
f-measure 0.854 0.881
cmlt 0.822 0.847
amlt 0.881 0.891

[currently in progress]