Search code examples
speech-recognitionsphinx4

Language Models and Sphinx4


I'm new to Sphinx and I'm trying to write a program that will recognize a word in an audio file that will only contain a single spoken word and then rate the confidence. For a project like this a language model doesn't seem necessary, seeing as how I'm only trying to recognize one word, but it seems like Sphinx needs a language model to do anything. Is such a thing possible?


Solution

  • unfortunately Sphinx (and any other ASR system) will need a language model to do anything. The reason is that the language model is used in the speech viterbi decoding and is required to assign a score to the many text possibilities.

    I assume that the audio files you are using can contain one of a set of possible words (since I'm not sure what the point would be if the audio files all contained the same words..). In that case, you can use a grammar rather than a statistical language model. In general grammars work well for small vocabulary tasks.

    Sphinx4 JSGFGrammar Documentation

    To get the confidence value, see the documentation for the class ConfidenceScorer, which can score Result's from the recognizer.

    ConfidenceScorer documentation with example

    If the audio file can be one of many unknown words, and you only want to recognize the single word you care about (i.e. you don't know what other words will be in the audio files, or it is a large enough set that you don't want to specify all the words in your grammar), then you have a pretty difficult task. Honestly, I've worked in speech recognition and I'm not entirely sure how one would do that.. you could try something like specifying a bunch of other words that have different phonetic characteristics (i.e. different syllable length, different types of sounds), and maybe it would work decently well. If this is the case let me know and I can come up with some other potential solutions, but my guess is that your task is recognizing one word out of a small set.