I'm using the librosa
library to get and filter spectrograms from audio data.
I mostly understand the math behind generating a spectrogram:
So that's really easy with librosa
:
spec = np.abs(librosa.stft(signal, n_fft=len(window), window=window)
Yay! I've got my matrix of FFTs. Now I see this function librosa.amplitude_to_db
and I think this is where my ignorance of signal processing starts to show. Here is a snippet I found on Medium:
spec = np.abs(librosa.stft(y, hop_length=512))
spec = librosa.amplitude_to_db(spec, ref=np.max)
Why does the author use this amplitude_to_db
function? Why not just plot the output of the STFT directly?
The range of perceivable sound pressure is very wide, from around 20 μPa (micro Pascal) to 20 Pa, a ratio of 1 million. Furthermore the human perception of sound levels is not linear, but better approximated by a logarithm.
By converting to decibels (dB) the scale becomes logarithmic. This limits the numerical range, to something like 0-120 dB instead. The intensity of colors when this is plotted corresponds more closely to what we hear than if one used a linear scale.
Note that the reference (0 dB) point in decibels can be chosen freely. The default for librosa.amplitude_to_db
is to compute numpy.max
, meaning that the max value of the input will be mapped to 0 dB. All other values will then be negative. The function also applies a threshold on the range of sounds, by default 80 dB. So anything lower than -80 dB will be clipped -80 dB.