Hi Recognition Experts,
I have a lot of mp3-files (original audio stream samplerate was 11.025 kHz) containing digits (0 - 9).
Different speakers (male/female) say for example "One", "Seven", "Three" etc. with pauses between them (~ 2 - 2.5 second)
I'm going to use CMU Sphinx to recognize the speech (desktop application). So I have some questions:
MP3 decoding: How do I decode my mp3 files meaning what samplerate should I specify to ffmpeg (as I know it's not recomended to upsample/downsample streams). Should I filter noises and/or frequency bands while decoding?
Acoustic models: If I don't upsample/downsample the stream, how can I find an acoustic model supporting 11025 kHz. If I do, what is the best model for digits?
Recognition mode: I found there are two modes for transcribing - Key spotting and Recognition. Whichmode would be better taking into account I have only digits (and some noise)
Thanks
UPD:
Nikolay, thank you for the answer. I've tried what you propose - it works!
If you don't mind I'd like to ask some additional qiestions:
I found that one of the voxforge acoustic models is more accurate than en-us-8khz. Is it ok?
Only 45% of files are recognized correct. Other 55% has 20-90% mistakes. Thus my question: Is there a possibility to estimate confidence of obtained results? For example, I could skip the files, which are "not surely" recognized?
If answer 2 is "no", what can you suggest to improve the accuracy? I know, the question is very abstract...
Thank you in advance!
UPD2:
By the way, the best parameters set (I just went through the various parameters) is:
-remove_dc yes -remove_noise no -vad_threshold 3.4 -vad_prespeech 19 -vad_postspeech 37 -silprob 2.5
MP3 decoding: How do I decode my mp3 files meaning what samplerate should I specify to ffmpeg (as I know it's not recomended to upsample/downsample streams). Should I filter noises and/or frequency bands while decoding?
ffmpeg -i file.mp3 -ar 8000 file.wav
Acoustic models: If I don't upsample/downsample the stream, how can I find an acoustic model supporting 11025 kHz. If I do, what is the best model for digits?
en-us-8khz is available in downloads, you need to create a digits grammar as in tutorial and then use it in the following way
pocketsphinx_continuous -infile file.wav -jsgf digits.gram -hmm en-us-8khz -samprate 8000
Recognition mode: I found there are two modes for transcribing - Key spotting and Recognition. Whichmode would be better taking into account I have only digits (and some noise)
Recognition mode