I am going through these two librosa docs: melspectrogram
and stft
.
I am working on datasets of audio of variable lengths, but I don't quite get the shapes. For example:
(waveform, sample_rate) = librosa.load('audio_file')
spectrogram = librosa.feature.melspectrogram(y=waveform, sr=sample_rate)
dur = librosa.get_duration(waveform)
spectrogram = torch.from_numpy(spectrogram)
print(spectrogram.shape)
print(sample_rate)
print(dur)
Output:
torch.Size([128, 150])
22050
3.48
What I get are the following points:
I am trying to understand or calculate:
What is n_fft? I mean what exactly is it doing to the audio wave? I read in the documentation the following:
n_fft : int > 0 [scalar]
length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value, n_fft=2048 samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the default sample rate in librosa.
This means that in each window 2048 samples are taken which means that --> 1/22050 * 2048 = 93[ms]. FFT is being calculated for every 93[ms] of the audio?
So, this means that the window size and window is for filtering the signal in this frame?
In the example above, I understand I am getting 128 number of Mel spectrograms but what exactly does that mean?
And what is hop_length? Reading the docs, I understand that it is how to shift the window from one fft window to the next right? If this value is 512 and n_fft = also 512, what does that mean? Does this mean that it will take a window of 23[ms], calculate FFT for this window and skip the next 23[ms]?
How can I specify that I want to overlap from one FFT window to another?
Please help, I have watched many videos of calculating spectrograms but I just can't seem to see it in real life.
The essential parameter to understanding the output dimensions of spectrograms is not necessarily the length of the used FFT (n_fft
), but the distance between consecutive FFTs, i.e., the hop_length
.
When computing an STFT, you compute the FFT for a number of short segments. These segments have the length n_fft
. Usually these segments overlap (in order to avoid information loss), so the distance between two segments is often not n_fft
, but something like n_fft/2
. The name for this distance is hop_length
. It is also defined in samples.
So when you have 1000 audio samples, and the hop_length is 100, you get 10 features frames (note that, if n_fft
is greater than hop_length, you may need to pad).
In your example, you are using the default hop_length
of 512. So for audio sampled at 22050 Hz, you get a feature frame rate of
frame_rate = sample_rate/hop_length = 22050 Hz/512 = 43 Hz
Again, padding may change this a little.
So for 10s of audio at 22050 Hz, you get a spectrogram array with the dimensions (128, 430)
, where 128 is the number of Mel bins and 430 the number of features (in this case, Mel spectra).