python audio convolution voice-recognition

How to embed fixed length spectograms to tensor?

How to compare two spectrograms and score their similarity? How to pick the whole model/approach?

Recordings from phone I convert from .m4a to .wav then plot the spectrogram in Python. Recordings have same length so data can be represented in the same dimensional space. I filtered using Butterworth bandpass filter (cutoff frequency 400Hz and 3500Hz):

To find region of interest, using OpenCV I filtered color (will make every clip different length which I don't want that):

Embedding spectrograms to multidimensional points and score their accuracy as distance to the most accurate sample would be visualisable thanks to dimensionality reduction in some cluster-like space. But that seems too plain, doesn't involve training and thus making it hard to verify. How to use convolution neural networks or a combination of convolution neural network and delayed neural network to embed this spectrogram to multidimensional points, to compare output of the network instead?

I switched to the Mel spectrogram:

How to use pre-trained convolution neural network models like VGG16 to embed spectrograms to tensors to compare them? Just remove last fully connected layer and flatten it instead?

Solution

In my opinion, and according to Yann Lecun, when you target speech recognition with Deep Neural Network you have two obligations:

You need to use a Recurrent Neural Network in order to have the memory ability (memory is really important for speech recognition...)

and

you will need a lot of training data

You may try to use RNN on tensorflow, but you definitely need a lot of training data.

If you don't want (or can't) find or generate a lot training data, you have forget the deep learning to solve this ...

In that case (forget deep learning) you may take a look of how Shazam work (based on fingerprint algorithm)