speech-recognition speech-to-text hidden-markov-models

Hidden markov models for phoneme recognition in continuous speech

I know how to apply hidden Markov model (HMM) when I have an isolated phoneme. I'd just have to create several HMM models (with al least 3 states per model), one for each phoneme, compute forward algorithm on all of them and see which one has the greatest probability.

But now I have a continuous speech database, phoneme labeled at each frame (TIMIT). How could I train a HMM so it can recognize phonemes in continuous speech?

Solution

In short: For continuous speech recognition you connect your phoneme models into a large HMM using auxiliary silence models.

First of all, you can train models on isolated phonemes and apply them to continuous speech. For instance, you can chunk your training audio according to the existing labels.

At the recognition step, applying Viterbi decoding (the most likely sequence of hidden states) to a combined model is equivalent to recognition of the sequence of phonemes. For more details you can study the corresponding chapter from HTK book.

To train HMMs on continuous data you also do a similar trick: combine single phoneme models into a large model for the whole underlying sentence. The training framework will find the best alignment between the model's states and the audio. Again, HTK book provides a nice tutorial on this.

Phoneme Recognition on the TIMIT Database provides a complete overview of the methods with a lot of references to papers. For instance, this classical article describes a basic method for context-independent phoneme recognition.