Search code examples
speech-recognitionvoice-recognitionspeech-to-textcmusphinxsphinx4

quality issue with offline voice-to-text using Sphinx4


I'd like to perform voice recognition on a large number of .wav files that are continually being generated.

There are a growing number of online voice-to-text API services (e.g. Google Cloud Speech, Amazon Lex, Twilio Speech Recognition, Nexmo Voice, etc.) which would work well for connected applications, but aren't suitable for this use case due to cost and bandwidth.

A quick google search suggested CMUSphinx (CMU = Carnegie Mellon University) is popular for speech recognition.

I tried the 'hello world' example:

import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.StreamSpeechRecognizer;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class Main {

    public static void main(String[] args) throws IOException {

        Configuration configuration = new Configuration();

        configuration.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
        configuration.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
        configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin");

        StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
        InputStream stream = new FileInputStream(new File("src/main/resources/test.wav"));

        recognizer.startRecognition(stream);
        SpeechResult result;
        while ((result = recognizer.getResult()) != null) {
            System.out.format("Hypothesis: %s\n", result.getHypothesis());
        }
        recognizer.stopRecognition();

    }
}

The result was slightly disappointing. The 'test.wav' file contains the following audio:

This is the first interval of speaking. After the first moment of silent, this is the second interval of speaking. After the third moment of silence, this the third interval of speaking and the last one.

This was interpreted as:

this is the first interval speaking ... for the first moment of silence is the second of all speaking ... for the for the moment of silence this is the f***ing several speaking in the last

Most of the words have been captured, but the output is garbled to the extent that the meaning is lost. I then downloaded a news story where the enunciation was crystal clear, and the transcription was complete gibberish. It captured as much as a very drunk person would listening to a foreign language.

I'm curious to know if anyone's using Sphinx4 successfully and, if so, what tweaks were done to make it work? Are there alternative acoustic/language models, dictionaries etc... that perform better? Any other open source suggestions for offline speech-to-text I should consider?


Solution

  • This turned out to be a trivial issue that's documented in the FAQ: "Q: What is sample rate and how does it affect accuracy"

    [...] we can not detect sample rate yet. So before using decoder you need to make sure that both sample rate of the decoder matches the sample rate of the input audio and the bandwidth of the audio matches the bandwidth that was used to train the model. A mismatch results in very bad accuracy.

    The news footage was BBC audio stereo, recorded at 44.1 khz.

    $ soxi GlobalNewsPodcast-20170828-CatastrophicFloodsRisin.wav
    
    Input File     : 'GlobalNewsPodcast-20170828-CatastrophicFloodsRisin.wav'
    Channels       : 2
    Sample Rate    : 44100
    Precision      : 16-bit
    Duration       : 00:29:23.79 = 77783087 samples = 132284 CDDA sectors
    File Size      : 311M
    Bit Rate       : 1.41M
    Sample Encoding: 16-bit Signed Integer PCM
    

    I converted it to mono:

    $ sox GlobalNewsPodcast-20170828-CatastrophicFloodsRisin.wav GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav remix 1,2
    $ soxi GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav
    
    Input File     : 'GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav'
    Channels       : 1
    Sample Rate    : 44100
    Precision      : 16-bit
    Duration       : 00:29:23.79 = 77783087 samples = 132284 CDDA sectors
    File Size      : 156M
    Bit Rate       : 706k
    Sample Encoding: 16-bit Signed Integer PCM
    

    Then downsampled to 16khz:

    $ sox GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav -r 16k GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono16k.wav
    $ soxi GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono16k.wav
    
    Input File     : 'GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono16k.wav'
    Channels       : 1
    Sample Rate    : 16000
    Precision      : 16-bit
    Duration       : 00:29:23.79 = 28220621 samples ~ 132284 CDDA sectors
    File Size      : 56.4M
    Bit Rate       : 256k
    Sample Encoding: 16-bit Signed Integer PCM
    

    Now it's working pretty well. Here's a snippet of transcribed audio from the news article:

    emergency officials said they expect the hall from million people to seek assistance in texas bolton flashy thousand people already being cared for in temporary shelter is on the engine is a big on releasing water from two downs that protect houston city sense of ...