I'm working on the speech signal processing area and I want to detect and time tag vowels and consonants from a audio file.
I'd like something such as (just and example, not sure how it works):
Using the word Done: D [0-3 ms], o [4-7 ms], n [8-11 ms], and e [12-13 ms].
I think I'm facing somehow a classificaion problem, I thought about using Support Vector Machines or Hidden Markov Models or Reccurant Neural Networks.
Any suggestions on how I should do it, the vowel or consonant detection, the time tagging.
Probably I'll use MATLAB. What do you think?
Thank you.
In case, u prefer using HMM, my suggestion using HTK (Hidden markov toolkit) there is a precise/detail tutorial, but it's in C (ANSI). Using HMM u must train HMM first (supervised), of course u need complete phonems/tag labeled train/examples. Finally what you need to do is what called phonems recognition to recognize others voice/test audio. After that, The duration of each recognized phonems/tag can be calculated after recognized by HTK.