I'm planning to make a platform to develop user's Pronunciation from specific words by speaking and checking the level of confidence returned by the IBM Speech-to-Text API (if it's something less than 85% they should try again). Can i use the 'word_confidence' in this scenario, or they shouldn't be used this way?
It's worth a try, but I can for see a number of hurdles.
How are you going to account for accents and dialects? A southern accent is just as understandable and as correct as a mid-western accent.
If you are only issuing one word speech audio files to be processed, then the STT service will not be able to make use of context to determine what word was actually said, and homophones are going to be especially tricky.
You have two choices:
1.word_alternatives as part of an alternatives option, but you would get confidence levels to all the words in all the alternative responses.
2.keyword match confidence levels. This is most likely going to be your best option.