How to analyze MP3 for beat/drums timestamps, trigger actions and playback at the same time (Rust)

I want to trigger an action (let a bright light flash for example) when the beat or drums in a mp3 file are present during playback. I don't know the theoretically procedure/approach I should take.

First I thought about statically analyzing the MP3 in the first step. The result of the analysis would be at which timestamps the action should be triggered. Then I start the MP3 and another thread starts the actions at the specific timings. This should be easy because I can use rodio-crate for playback. But the static analyzing parts is still heavy.

Analysis algorithm:

My idea was to read the raw audio data from a MP3 using minimp3-crate and do a FFT with rustfft-crate. When I have the spectrum analysis from FFT I could look where the deep frequencies are on a high volume and this should be the beat of the song.

I tried combining minimp3 and rustfft but I have absolutely no clue what the data that I get really means.. And I can't write a test for it really either..

This is my approach so far:

use minimp3::{Decoder, Frame, Error};

use std::fs::File;
use std::sync::Arc;
use rustfft::FFTplanner;
use rustfft::num_complex::Complex;
use rustfft::num_traits::{Zero, FromPrimitive, ToPrimitive};

fn main() {
    let mut decoder = Decoder::new(File::open("08-In the end.mp3").unwrap());

    loop {
        match decoder.next_frame() {
            Ok(Frame { data, sample_rate, channels, .. }) => {
                // we only need mono data; because data is interleaved
                // data[0] is first value channel left, data[1] is first channel right, ...
                let mut mono_audio = vec![];
                for i in 0..data.len() / channels {
                    let sum = data[i] as i32 + data[i+1] as i32;
                    let avg = (sum / 2) as i16;
                    mono_audio.push(avg);
                }
                // unnormalized spectrum; now check where the beat/drums are 
                // by checking for high volume in low frequencies
                let spectrum = calc_fft(&mono_audio);
            },
            Err(Error::Eof) => break,
            Err(e) => panic!("{:?}", e),
        }
    }
}

fn calc_fft(raw_mono_audio_data: &Vec<i16>) -> Vec<i16> {
    // Perform a forward FFT of size 1234

    let len = raw_mono_audio_data.len();

    let mut input:  Vec<Complex<f32>> = vec![];
    //let mut output: Vec<Complex<f32>> = vec![Complex::zero(); 256];
    let mut spectrum: Vec<Complex<f32>> = vec![Complex::zero(); len];

    // from Vec<i16> to Vec<Complex<f32>>
    raw_mono_audio_data.iter().for_each(|val| {
        let compl = Complex::from_i16(*val).unwrap();
        input.push(compl);
    });

    let mut planner = FFTplanner::new(false);
    let fft = planner.plan_fft(len);
    fft.process(&mut input, &mut spectrum);

    // to Vec<i16>
    let mut output_i16 = vec![];
    spectrum.iter().for_each(|val| {
        if let Some(val) = val.to_i16() {
            output_i16.push(val);
        }
    });

    output_i16
}

My problem is also that the FFT function doesn't have any parameter where I can specify the sample_rate (which is 48.000kHz). All I get from decoder.next_frame() is Vec<i16> with 2304 items..

Any ideas how I can achive that and what the numbers I currently get actually mean?

Solution

TL;DR:

Decouple analysis and audio data preparation. (1) Read the MP3/WAV data, join the two channels to mono (easier analysis), take slices from the data with a length that is a power of 2 (for the FFT; if required fill with additional zeroes) and finally (2) apply that data to the crate spectrum_analyzer and learn from the code (which is excellently documented) how the presence of certain frequencies can be obtained from the FFT.

Longer version

Decouple the problem into smaller problems/subtasks.

analysis of audio data in discrete windows => beat: yes or no
- a "window" is usually a fixed-size view into the on-going stream of audio data
- choose a strategy here: for example a lowpass filter, a FFT, a combination, ... search for "beat detection algorithm" in literature
  - if you are doing an FFT, you should extend your data window always to the next power of 2 (e.g. fill with zeroes).
read the mp3, convert it to mono and then pass the audio samples step by step to the analysis algorithm.
- You can use the sampling rate and the sample index to calculate the point in time
- => attach "beat: yes/no" to timestamps inside the song

The analysis-part should be kept generally usable, so that it works for live audio as well as files. Music is usually discretized with 44100Hz or 48000Hz and 16 bit resolution. All common audio libraries will give you an interface to access audio input from the microphone with these properties. If you read a MP3 or a WAV instead, the music (the audio data) is usually in the same format. If you analyze windows of a length of 2048 at 44100Hz for example, each window has a length of 1/f * n == T * n == n/f == (2048/44100)s == ~46,4ms. The shorter the time window, the faster your beat detection can operate but the less your accuracy will be - it's a tradeoff :) Your algorithm could keep knowledge about previous windows to overlap them to reduce noise/wrong data.

To view existing code that solves these sub-problems, I suggest the following crates

https://crates.io/crates/lowpass-filter : Simple low pass filter to get the low frequencies of a data window => (probably a) beat
https://crates.io/crates/spectrum-analyzer : spectrum analysis of an audio window with FFT and excellent documentation about how it is done inside the repository

With the crate beat detector there is a solution that pretty much implements the original content of this question. It connects live audio input with the analysis algorithm.