Search code examples
machine-learningaudioneural-networksignal-processingvoice-recognition

My speaker recognition neural network doesn’t work well


I have a final project in my first degree and I want to build a Neural Network that gonna take the first 13 mfcc coeffs of a wav file and return who talked in the audio file from a banch of talkers.

I want you to notice that:

  1. My audio files are text independent, therefore they have different length and words
  2. I have trained the machine on about 35 audio files of 10 speaker ( the first speaker had about 15, the second 10, and the third and fourth about 5 each )

I defined :

X=mfcc(sound_voice)

Y=zero_array + 1 in the i_th position ( where i_th position is 0 for the first speaker, 1 for the second, 2 for the third... )

And than trained the machine, and than checked the output of the machine for some files...

So that’s what I did... but unfortunately it’s look like the results are completely random...

Can you help me understand why?

This is my code in python -

from sklearn.neural_network import MLPClassifier
import python_speech_features
import scipy.io.wavfile as wav
import numpy as np
from os import listdir
from os.path import isfile, join
from random import shuffle
import matplotlib.pyplot as plt
from tqdm import tqdm

winner = []  # this array count how much Bingo we had when we test the NN
for TestNum in tqdm(range(5)):  # in every round we build NN with X,Y that out of them we check 50 after we build the NN
    X = []
    Y = []
    onlyfiles = [f for f in listdir("FinalAudios/") if isfile(join("FinalAudios/", f))]   # Files in dir
    names = []  # names of the speakers
    for file in onlyfiles:  # for each wav sound
        # UNESSECERY TO UNDERSTAND THE CODE
        if " " not in file.split("_")[0]:
            names.append(file.split("_")[0])
        else:
            names.append(file.split("_")[0].split(" ")[0])
    names = list(dict.fromkeys(names))  # names of speakers
    vector_names = []  # vector for each name
    i = 0
    vector_for_each_name = [0] * len(names)
    for name in names:
        vector_for_each_name[i] += 1
        vector_names.append(np.array(vector_for_each_name))
        vector_for_each_name[i] -= 1
        i += 1
    for f in onlyfiles:
        if " " not in f.split("_")[0]:
            f_speaker = f.split("_")[0]
        else:
            f_speaker = f.split("_")[0].split(" ")[0]
        (rate, sig) = wav.read("FinalAudios/" + f)  # read the file
        try:
            mfcc_feat = python_speech_features.mfcc(sig, rate, winlen=0.2, nfft=512)  # mfcc coeffs
            for index in range(len(mfcc_feat)):  # adding each mfcc coeff to X, meaning if there is 50000 coeffs than
                # X will be [first coeff, second .... 50000'th coeff] and Y will be [f_speaker_vector] * 50000
                X.append(np.array(mfcc_feat[index]))
                Y.append(np.array(vector_names[names.index(f_speaker)]))
        except IndexError:
            pass
    Z = list(zip(X, Y))

    shuffle(Z)  # WE SHUFFLE X,Y TO PERFORM RANDOM ON THE TEST LEVEL

    X, Y = zip(*Z)
    X = list(X)
    Y = list(Y)
    X = np.asarray(X)
    Y = np.asarray(Y)

    Y_test = Y[:50]  # CHOOSE 50 FOR TEST, OTHERS FOR TRAIN
    X_test = X[:50]
    X = X[50:]
    Y = Y[50:]

    clf = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(5, 3), random_state=2)  # create the NN
    clf.fit(X, Y)  # Train it

    for sample in range(len(X_test)):  # add 1 to winner array if we correct and 0 if not, than in the end it plot it
        if list(clf.predict([X[sample]])[0]) == list(Y_test[sample]):
            winner.append(1)
        else:
            winner.append(0)

# plot winner
plot_x = []
plot_y = []
for i in range(1, len(winner)):
    plot_y.append(sum(winner[0:i])*1.0/len(winner[0:i]))
    plot_x.append(i)
plt.plot(plot_x, plot_y)
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')

# giving a title to my graph
plt.title('My first graph!')

# function to show the plot
plt.show()

This is my zip file that contains the code and the audio file : https://ufile.io/eggjm1gw


Solution

  • You have a number of issues in your code and it will be close to impossible to get it right in one go, but let's give it a try. There are two major issues:

    • Currently you're trying to teach your neural network with very few training examples, as few as a single one per speaker (!). It's impossible for any machine learning algorithm to learn anything.
    • To make matters worse, what you do is that you feed to the ANN only MFCC for the first 25 ms of each recording (25 comes from winlen parameter of python_speech_features). In each of these recordings, first 25 ms will be close to identical. Even if you had 10k recordings per speaker, with this approach you'd not get anywhere.

    I will give you concrete advise, but won't do all the coding - it's your homework after all.

    • Use all MFCC, not just first 25 ms. Many of these should be skipped, simply because there's no voice activity. Normally there should be VOD (Voice Activity Detector) telling you which ones to take, but in this exercise I'd skip it for starter (you need to learn basics first).
    • Don't use dictionaries. Not only it won't fly with more than one MFCC vector per speaker, but also it's very inefficient data structure for your task. Use numpy arrays, they're much faster and memory efficient. There's a ton of tutorials, including scikit-learn that demonstrate how to use numpy in this context. In essence, you create two arrays: one with training data, second with labels. Example: if omersk speaker "produces" 50000 MFCC vectors, you will get (50000, 13) training array. Corresponding label array would be 50000 with single constant value (id) that corresponds to the speaker (say, omersk is 0, lucas is 1 and so on). I'd consider taking longer windows (perhaps 200 ms, experiment!) to reduce the variance.

    Don't forget to split your data for training, validation and test. You will have more than enough data. Also, for this exercise I'd watch for not feeding too much of data for any single speaker - ot taking steps to make sure algorithm is not biased.

    Later, when you make prediction, you will again compute MFCCs for the speaker. With 10 sec recording, 200 ms window and 100 ms overlap, you'll get 99 MFCC vectors, shape (99, 13). The model should run on each of the 99 vectors, for each producing probability. When you sum it (and normalise, to make it nice) and take top value, you'll get the most likely speaker.

    There's a dozen of other things that typically would be taken into account, but in this case (homework) I'd focus on getting the basics right.

    EDIT: I decided to take a stab at creating the model with your idea at heart, but basics fixed. It's not exactly clean Python, all because it's adapted from Jupyter Notebook I was running.

    import python_speech_features
    import scipy.io.wavfile as wav
    import numpy as np
    import glob
    import os
    
    from collections import defaultdict
    from sklearn.neural_network import MLPClassifier
    from sklearn import preprocessing
    from sklearn.model_selection import cross_validate
    from sklearn.ensemble import RandomForestClassifier
    
    
    audio_files_path = glob.glob('audio/*.wav')
    win_len = 0.04 # in seconds
    step = win_len / 2
    nfft = 2048
    
    mfccs_all_speakers = []
    names = []
    data = []
    
    for path in audio_files_path:
        fs, audio = wav.read(path)
        if audio.size > 0:
            mfcc = python_speech_features.mfcc(audio, samplerate=fs, winlen=win_len,
                                                winstep=step, nfft=nfft, appendEnergy=False)
            filename = os.path.splitext(os.path.basename(path))[0]
            speaker = filename[:filename.find('_')]
            data.append({'filename': filename,
                         'speaker': speaker,
                         'samples': mfcc.shape[0],
                         'mfcc': mfcc})
        else:
            print(f'Skipping {path} due to 0 file size')
    
    speaker_sample_size = defaultdict(int)
    for entry in data:
        speaker_sample_size[entry['speaker']] += entry['samples']
    
    person_with_fewest_samples = min(speaker_sample_size, key=speaker_sample_size.get)
    print(person_with_fewest_samples)
    
    max_accepted_samples = int(speaker_sample_size[person_with_fewest_samples] * 0.8)
    print(max_accepted_samples)
    
    training_idx = []
    test_idx = []
    accumulated_size = defaultdict(int)
    
    for entry in data:
        if entry['speaker'] not in accumulated_size:
            training_idx.append(entry['filename'])
            accumulated_size[entry['speaker']] += entry['samples']
        elif accumulated_size[entry['speaker']] < max_accepted_samples:
            accumulated_size[entry['speaker']] += entry['samples']
            training_idx.append(entry['filename'])
    
    X_train = []
    label_train = []
    
    X_test = []
    label_test = []
    
    for entry in data:
        if entry['filename'] in training_idx:
            X_train.append(entry['mfcc'])
            label_train.extend([entry['speaker']] * entry['mfcc'].shape[0])
        else:
            X_test.append(entry['mfcc'])
            label_test.extend([entry['speaker']] * entry['mfcc'].shape[0])
    
    X_train = np.concatenate(X_train, axis=0)
    X_test = np.concatenate(X_test, axis=0)
    
    assert (X_train.shape[0] == len(label_train))
    assert (X_test.shape[0] == len(label_test))
    
    print(f'Training: {X_train.shape}')
    print(f'Testing: {X_test.shape}')
    
    le = preprocessing.LabelEncoder()
    y_train = le.fit_transform(label_train)
    y_test = le.transform(label_test)
    
    clf = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(5, 3), random_state=42, max_iter=1000)
    
    cv_results = cross_validate(clf, X_train, y_train, cv=4)
    print(cv_results)
    
    {'fit_time': array([3.33842635, 4.25872731, 4.73704267, 5.9454329 ]),
     'score_time': array([0.00125694, 0.00073504, 0.00074005, 0.00078583]),
     'test_score': array([0.40380048, 0.52969121, 0.48448687, 0.46043165])}
    

    The test_score isn't stellar. There's a lot to improve (for starter, choice of algorithm), but the basics are there. Notice for starter how I get the training samples. It's not random, I only consider recordings as whole. You can't put samples from a given recording to both training and test, as test is supposed to be novel.

    What was not working in your code? I'd say a lot. You were taking 200ms samples and yet very short fft. python_speech_features likely complained to you that the fft is should be longer than the frame you're processing.

    I leave to you testing the model. It won't be good, but it's a starter.