Search code examples
performancespeech-recognitionspeech-to-textsiri

Speech recognition, Such as Siri


Softwares such as Siri, takes voice command and responds to those questions appropriately(98%). I wanted to know that when we write a software to take input stream of voice signal and to respond to those questions,

Do we need to convert the input into human readable language? such as English?

As in nature we have so many different languages but when we speak, we basically make different noise. That's it. However, we have created the so called alphabet to denote those noise variations.

So, Again my question is when we write speech recognition algorithms, Do we match those noise variation signals with our database or first we convert those noise variation into English and then check what to answer from database?


Solution

  • The "noise variation signals" you are referring to are called phonemes. How a speech recognition system translates these phonemes int words depends on the type of system. Siri is not a grammar based system where you tell the speech recognition system what types of phrases you are expecting based on a set of rules. Since Siri translates speech in an open context it probably uses some type of statistical modeling. A popular statistical model for speech recognition today is the Hidden Markov Model. While there is a database of sorts involved it is not a simple search of groups of phonemes into words. There is a pretty good high level description of the process and issues with translation here.