Search code examples
pythontensorflowaudiolibrosamfcc

MFCC Python: completely different result from librosa vs python_speech_features vs tensorflow.signal


I'm trying to do extract MFCC features from audio (.wav file) and I have tried python_speech_features and librosa but they are giving completely different results:

audio, sr = librosa.load(file, sr=None)

# librosa
hop_length = int(sr/100)
n_fft = int(sr/40)
features_librosa = librosa.feature.mfcc(audio, sr, n_mfcc=13, hop_length=hop_length, n_fft=n_fft)

# psf
features_psf = mfcc(audio, sr, numcep=13, winlen=0.025, winstep=0.01)

Below are the plots:

librosa: enter image description here

python_speech_features: enter image description here

Did I pass any parameters wrong for those two methods? Why there's such a huge difference here?

Update: I have also tried tensorflow.signal implementation, and here's the result:

enter image description here

The plot itself matches closer to the one from librosa, but the scale is closer to python_speech_features. (Note that here I calculated 80 mel bins and took the first 13; if I do the calculation with only 13 bins, the result looks quite different as well). Code below:

stfts = tf.signal.stft(audio, frame_length=n_fft, frame_step=hop_length, fft_length=512)
spectrograms = tf.abs(stfts)

num_spectrogram_bins = stfts.shape[-1]
lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, 7600.0, 80
linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
    num_mel_bins, num_spectrogram_bins, sr, lower_edge_hertz, upper_edge_hertz)
mel_spectrograms = tf.tensordot(spectrograms, linear_to_mel_weight_matrix, 1)
mel_spectrograms.set_shape(spectrograms.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:]))

log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)
features_tf = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrograms)[..., :13]
features_tf = np.array(features_tf).T

I think my question is: which output is closer to what MFCC actually looks like?


Solution

  • There are at least two factors at play here that explain why you get different results:

    1. There is no single definition of the mel scale. Librosa implement two ways: Slaney and HTK. Other packages might and will use different definitions, leading to different results. That being said, overall picture should be similar. That leads us to the second issue...
    2. python_speech_features by default puts energy as first (index zero) coefficient (appendEnergy is True by default), meaning that when you ask for e.g. 13 MFCC, you effectively get 12 + 1.

    In other words, you were not comparing 13 librosa vs 13 python_speech_features coefficients, but rather 13 vs 12. The energy can be of different magnitude and therefore produce quite different picture due to the different colour scale.

    I will now demonstrate how both modules can produce similar results:

    import librosa
    import python_speech_features
    import matplotlib.pyplot as plt
    from scipy.signal.windows import hann
    import seaborn as sns
    
    n_mfcc = 13
    n_mels = 40
    n_fft = 512 
    hop_length = 160
    fmin = 0
    fmax = None
    sr = 16000
    y, sr = librosa.load(librosa.util.example_audio_file(), sr=sr, duration=5,offset=30)
    
    mfcc_librosa = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
                                        n_mfcc=n_mfcc, n_mels=n_mels,
                                        hop_length=hop_length,
                                        fmin=fmin, fmax=fmax, htk=False)
    
    mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
                                              numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
                                              preemph=0.0, ceplifter=0, appendEnergy=False, winfunc=hann)
    

    spectrogram 1

    As you can see the scale is different, but overall picture looks really similar. Note that I had to make sure that a number of parameters passed to the modules is the same.