c++signal-processing fft voice-recognition libav

Voice recognition based on signal/spectrum analysis

I'm working on a solution to recognize an audio word by comparing the signal and spectrum.
To decode audio, I use libavcodec and libavformat and I use 1 reference word and compare to other.
Example:

# Must return true
./vrecog --file_ref chocolat.wav --file_cmp chocolat_2.wav
# Must return false
./vrecog --file_ref chocolat.wav --file_cmp banana.wav

My step:

I put the signal in a std::vector
I transform the signal into spectrum with Fast Fourier Transform
I calculate the [min, max, average, std_deviation, variance] of my spectrum
I use the values in step 3 to calculate a correlation coefficient

Is the reasoning correct ? The coefficient is always neer to 1 and I don't know what can I use to efficiently comparing data to says if the words is the same or not.

These are my plots:
Signals (chocolat, chocolat_2 and banana):
Spectrum (chocolat, chocolat_2 and banana):

We can easily see that the signal and spectrum seems near for both "chocolat" words, but I'm not able to get a percentage of similarity.

Solution

For signals, this is typically done via the cross-correlation function (of two signals) which is very similar to convolution. As such, it can be mathematically done via the FFT, which is specifically designed to be efficient. Once you take the correlation function, you can decide what threshold you want to be a "match", etc. For more information I'd read up at: http://www.aip.de/groups/soe/local/numres/bookcpdf/c13-2.pdf since what we're talking about is pretty math heavy and was taught across a few weeks in one of my college courses.