Search code examples
pythonaudioconvolutionvoice-recognition

How to embed fixed length spectograms to tensor?


How to compare two spectrograms and score their similarity? How to pick the whole model/approach?

Recordings from phone I convert from .m4a to .wav then plot the spectrogram in Python. Recordings have same length so data can be represented in the same dimensional space. I filtered using Butterworth bandpass filter (cutoff frequency 400Hz and 3500Hz):

Filtered

To find region of interest, using OpenCV I filtered color (will make every clip different length which I don't want that):

mask

Embedding spectrograms to multidimensional points and score their accuracy as distance to the most accurate sample would be visualisable thanks to dimensionality reduction in some cluster-like space. But that seems too plain, doesn't involve training and thus making it hard to verify. How to use convolution neural networks or a combination of convolution neural network and delayed neural network to embed this spectrogram to multidimensional points, to compare output of the network instead?

I switched to the Mel spectrogram:

mel

How to use pre-trained convolution neural network models like VGG16 to embed spectrograms to tensors to compare them? Just remove last fully connected layer and flatten it instead?


Solution

  • In my opinion, and according to Yann Lecun, when you target speech recognition with Deep Neural Network you have two obligations:

    • You need to use a Recurrent Neural Network in order to have the memory ability (memory is really important for speech recognition...)

    and

    • you will need a lot of training data

    You may try to use RNN on tensorflow, but you definitely need a lot of training data.

    If you don't want (or can't) find or generate a lot training data, you have forget the deep learning to solve this ...

    In that case (forget deep learning) you may take a look of how Shazam work (based on fingerprint algorithm)