Search code examples
voice-recognitioncmusphinx

How can I remove words from the dictionary on cmusphinx?


I am trying cmusphinx with spanish language. I downloaded the spanish model and dict, but the accuracy is poor...

I tried to remove all the words from the "es.dict" less my needed words. And the accuracy changes to 100% (removing 99% of words...).

But this changes generated another problem with the performance, i think the system is trying to read each word in the file "es-20k.lm".

My output shown this for each removed word: "nov 12, 2016 11:05:14 PM edu.cmu.sphinx.linguist.dictionary.TextDictionary getWord INFORMACIÓN: The dictionary is missing a phonetic transcription for the word 'argumento'"

How can remove the unused words in the spanish model? It is possible? I only want modify the dictionary of this model, removing the unused words. (I only want about 50 words at this moment..).

I was trying the suggested tools in the documentation but i don't understand it, or i don't look how do it.

Thanks.


Solution

  • You should keep dictionary the same. You need to write the grammar in a text editor or build the language model with srilm as advised by language model tutorial.

    Overall, reducing language vocabulary is not the only way to improve accuracy, usually bad accuracy is caused by noise, recording conditions mismatch and other factors. You need to work on them as well.