Search code examples
cmusphinxpocketsphinx

Identify start/stop times of spoken words within a phrase using Sphinx


I'm trying to identify the start/end time of individual words within a phrase. I have a WAV file of the phrase AND the text of the utterance.

Is there an intelligent way of combining these two data (audio, text) to improve Sphinx's recognition abilities? What I'd like as output are accurate start/stop times for each word within the phrase.

(I know you can pass -time yes to pocketsphinx to get the time data I'm looking for -- however, the speech recognition itself is not very accurate.)

The solution cannot be for a specific speaker, as the corpus I'm working with contains a lot of different speakers, although they are all using US English.


Solution

  • We have a specific tool for that - audio aligner in sphinx4. You can check

    http://cmusphinx.sourceforge.net/2014/07/long-audio-aligner-landed-in-trunk/