python signal-processing librosa audio-analysis

How can we improve tempo detection accuracy in Librosa?

I'm using the native beat_track function from Librosa:

from librosa.beat import beat_track
tempo, beat_frames = beat_track(audio, sampling_rate)

The original tempo of the song is at 146 BPM whereas the function approximates 73.5 BPM. While I understand 73.5*2 ~ 148 BPM, how can we achieve the following:

Know when to scale up/down estimations
Increase accuracy by pre-processing the signal

Solution

What you observe is the so-called "octave-error", i.e., the estimate is wrong by a factor of 2, 1/2, 3, or 1/3. It's a quite common problem in global tempo estimation. A great, classic introduction to global tempo estimation can be found in An Experimental Comparison of Audio Tempo Induction Algorithms. The article also introduces the common metrics Acc1 and Acc2.

Since the publication of that article, many researchers have tried to solve the octave-error problem. The (from my very biased point of view) most promising ones are A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network by myself (you might also want to check out this later paper, which uses a simpler NN architecure) and Multi-Task Learning of Tempo and Beat: Learning One to Improve the Other by Böck et al.

Both approaches use convolutional neural networks (CNNs) to analyze the spectrograms. While a CNN could also be implemented in librosa, it currently is missing the programmatic infrastructure to easily do this. Another audio analysis framework seems to be a step ahead in this regard: Essentia. It is capable of running TensorFlow-models.