signals signal-processing speech-recognition speech-synthesis

Detect vowels and consonants?

I'm working on the speech signal processing area and I want to detect and time tag vowels and consonants from a audio file.

I'd like something such as (just and example, not sure how it works):

Using the word Done: D [0-3 ms], o [4-7 ms], n [8-11 ms], and e [12-13 ms].

I think I'm facing somehow a classificaion problem, I thought about using Support Vector Machines or Hidden Markov Models or Reccurant Neural Networks.

Any suggestions on how I should do it, the vowel or consonant detection, the time tagging.

Probably I'll use MATLAB. What do you think?

Thank you.

Solution

In case, u prefer using HMM, my suggestion using HTK (Hidden markov toolkit) there is a precise/detail tutorial, but it's in C (ANSI). Using HMM u must train HMM first (supervised), of course u need complete phonems/tag labeled train/examples. Finally what you need to do is what called phonems recognition to recognize others voice/test audio. After that, The duration of each recognized phonems/tag can be calculated after recognized by HTK.