Search code examples
speech-recognitionvoice-recognitioncmusphinxpocketsphinx

Digits recognition with CMU Sphinx


Hi Recognition Experts,

I have a lot of mp3-files (original audio stream samplerate was 11.025 kHz) containing digits (0 - 9).

Different speakers (male/female) say for example "One", "Seven", "Three" etc. with pauses between them (~ 2 - 2.5 second)

I'm going to use CMU Sphinx to recognize the speech (desktop application). So I have some questions:

  1. MP3 decoding: How do I decode my mp3 files meaning what samplerate should I specify to ffmpeg (as I know it's not recomended to upsample/downsample streams). Should I filter noises and/or frequency bands while decoding?

  2. Acoustic models: If I don't upsample/downsample the stream, how can I find an acoustic model supporting 11025 kHz. If I do, what is the best model for digits?

  3. Recognition mode: I found there are two modes for transcribing - Key spotting and Recognition. Whichmode would be better taking into account I have only digits (and some noise)

Thanks

UPD:

Nikolay, thank you for the answer. I've tried what you propose - it works!

If you don't mind I'd like to ask some additional qiestions:

  1. I found that one of the voxforge acoustic models is more accurate than en-us-8khz. Is it ok?

  2. Only 45% of files are recognized correct. Other 55% has 20-90% mistakes. Thus my question: Is there a possibility to estimate confidence of obtained results? For example, I could skip the files, which are "not surely" recognized?

  3. If answer 2 is "no", what can you suggest to improve the accuracy? I know, the question is very abstract...

Thank you in advance!

UPD2:

By the way, the best parameters set (I just went through the various parameters) is:

-remove_dc yes -remove_noise no -vad_threshold 3.4 -vad_prespeech 19 -vad_postspeech 37 -silprob 2.5

Solution

  • MP3 decoding: How do I decode my mp3 files meaning what samplerate should I specify to ffmpeg (as I know it's not recomended to upsample/downsample streams). Should I filter noises and/or frequency bands while decoding?

     ffmpeg -i file.mp3 -ar 8000 file.wav
    

    Acoustic models: If I don't upsample/downsample the stream, how can I find an acoustic model supporting 11025 kHz. If I do, what is the best model for digits?

    en-us-8khz is available in downloads, you need to create a digits grammar as in tutorial and then use it in the following way

     pocketsphinx_continuous -infile file.wav -jsgf digits.gram -hmm en-us-8khz -samprate 8000
    

    Recognition mode: I found there are two modes for transcribing - Key spotting and Recognition. Whichmode would be better taking into account I have only digits (and some noise)

    Recognition mode