python neural-network voice-recognition mfcc pitch

Features for speaker recognition that can be added to mfcc features/ Things that I can do in order to improve my speaker recognition neural network

I'm trying to create a speaker recognition machine learning.

Currently i'm using the following scheme:

taking my audio files data set and computing for each 0.15 seconds of the audio file 13 mel freaquency coeffs
each 13 coeffs I input to a neural network that based on 3 blocks of [conv, pool, norm]
for the test files i use a majority over all the outpus for each 13 coeffs vector

I usually get about 85% recognition rate for 3 speakers which is not amazing and therefore I decided that I want to add some features, but I don't know what to add...

Someone has a recommendations to what feature should I add/ what should I do in order to increase my precentage?

I tried to use a module that call - "pitch" which give me the pitch of a wav file but it gave me very randomic values ( for example for the same speaker it gave me 360, 80, 440 for the 3 first audios )

Thanks alot for any help

Solution

You should be processing longer chunks at once, in 0.15 seconds is almost impossible identify speaker identity.

The general rule is the longer audio you process, the more accurate recognition you will have. Something like 1-3 seconds is good and you need to input them to neural network as a whole.

You can google for x-vector on github, there are many implementation, you can find one in kaldi for example.