I'm trying to do extract MFCC features from audio (.wav file) and I have tried python_speech_features
and librosa
but they are giving completely different results:
audio, sr = librosa.load(file, sr=None)
# librosa
hop_length = int(sr/100)
n_fft = int(sr/40)
features_librosa = librosa.feature.mfcc(audio, sr, n_mfcc=13, hop_length=hop_length, n_fft=n_fft)
# psf
features_psf = mfcc(audio, sr, numcep=13, winlen=0.025, winstep=0.01)
Below are the plots:
Did I pass any parameters wrong for those two methods? Why there's such a huge difference here?
Update: I have also tried tensorflow.signal implementation, and here's the result:
The plot itself matches closer to the one from librosa, but the scale is closer to python_speech_features. (Note that here I calculated 80 mel bins and took the first 13; if I do the calculation with only 13 bins, the result looks quite different as well). Code below:
stfts = tf.signal.stft(audio, frame_length=n_fft, frame_step=hop_length, fft_length=512)
spectrograms = tf.abs(stfts)
num_spectrogram_bins = stfts.shape[-1]
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 7600.0, 80
linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
num_mel_bins, num_spectrogram_bins, sr, lower_edge_hertz, upper_edge_hertz)
mel_spectrograms = tf.tensordot(spectrograms, linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))
log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)
features_tf = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrograms)[..., :13]
features_tf = np.array(features_tf).T
I think my question is: which output is closer to what MFCC actually looks like?
There are at least two factors at play here that explain why you get different results:
Librosa
implement two ways: Slaney and HTK. Other packages might and will use different definitions, leading to different results. That being said, overall picture should be similar. That leads us to the second issue...python_speech_features
by default puts energy as first (index zero) coefficient (appendEnergy
is True
by default), meaning that when you ask for e.g. 13 MFCC, you effectively get 12 + 1. In other words, you were not comparing 13 librosa
vs 13 python_speech_features
coefficients, but rather 13 vs 12. The energy can be of different magnitude and therefore produce quite different picture due to the different colour scale.
I will now demonstrate how both modules can produce similar results:
import librosa
import python_speech_features
import matplotlib.pyplot as plt
from scipy.signal.windows import hann
import seaborn as sns
n_mfcc = 13
n_mels = 40
n_fft = 512
hop_length = 160
fmin = 0
fmax = None
sr = 16000
y, sr = librosa.load(librosa.util.example_audio_file(), sr=sr, duration=5,offset=30)
mfcc_librosa = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
n_mfcc=n_mfcc, n_mels=n_mels,
hop_length=hop_length,
fmin=fmin, fmax=fmax, htk=False)
mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
preemph=0.0, ceplifter=0, appendEnergy=False, winfunc=hann)
As you can see the scale is different, but overall picture looks really similar. Note that I had to make sure that a number of parameters passed to the modules is the same.