Search code examples
javaspeech-recognitionspeechspeech-to-text

How to setup a Speech Recognition Server?


How to implement Speech recognition at server side (please don't suggest HTML5's x-webkit-speech, javascript etc) ? The program will take an audio file as input and with sufficient accuracy provides the text transcription of audio file. What are the options I can use ?

I have tried implementing Sphin4 with Voxforge model but the accuracy is so poor (their may be also some problem in my configuration, I am still trying to learn it). In one post I read that when we use <input name="speech" id="speech" type="text" x-webkit-speech /> the input is sent to an external server and that server than does the recognition and sends the data back to the browser.

How can I setup that server ? Any existing open Source server would be also useful if it can recognize English sentences with minimal error rate.


Solution

  • You have some problems: 1. How to capture audio in a client. 2. How to transfer these audio for a server. 3. How to make recognition. 4. How to transfer back the recognition and confidence score. 5. What are you going to do with these recognition and confidence score (your application).

    For the first case, you can use Google approach that someone click in a microphone icon, record the voice for some times. Or, iPhone Siri, where a VAD is used to record audio.

    Second, it is basic a TCP/IP file transfer problem. It is also possible to use Apple / Google approach and compress audio file using Flac or Speex.

    Third, this is the really hard part. You need much better acoustic models that ones that you can get from Voxforge. This is special true for a continuous speech recognition, context free like Siri. For commands, Voxforge is fine.

    Forth, it is another file transfer problem.

    Fifth, it is your application.

    The hard part is speech recognition part. Perhaps other problem is how to scale this for thousands of users. You can use Julius speech recognition as a speech client to capture audio. We can chat more about this problem privately.