Search code examples
androidspeech-recognitionpocketsphinx

Can I modify PocketSphinx's keyword recognizer "refresh rate"?


I'm running PocketSphinx on Android (version 5prealpha). I'm using a file-defined keyword recognizer, specified by the following snippet (kwfile is the keyword definition file, and mRecognizer is an instance of SpeechRecognizer):

mRecognizer.addKeywordSearch(DESCRIPTOR, kwfile);

Overall, the recognition performance is pretty good, after having optimized the keyword thresholds. However, if I wait some arbitrary amount of time (5 sec up to several minutes) between one keyword utterance and the next, the recognition performance suffers on the second utterance. For example, I'll say "keyword," and it will be recognized. If I wait less than 5 sec and say "keyword" again, the second utterance will likely be recognized (recognition rate over 95%). If, however, I wait 15 sec, the recognition rate drops dramatically, to less than 50%.

My hypothesis is that when I say the keyword the second time, the recognizer is in the middle of a refresh - that is it's between a Stop Recognition event and a Start Recognition event, and that my speech transcends that event. Here is a typical view of my logcat. Notice that after 5 sec, the recognizer "refreshes". This happens about every 5 sec, for the most part. Sometimes it can be as long as 30 sec between "refreshes", but generally it's around 5 sec.

09-26 07:11:06.800  20397-20397/...﹕ Start recognition "kwfile"
09-26 07:11:06.815  20397-23642/...﹕ Starting decoding
09-26 07:11:11.310  20397-20397/...﹕ Stop recognition
09-26 07:11:11.315  20397-20397/...﹕ Start recognition "kwfile"
09-26 07:11:11.360  20397-23645/...﹕ Starting decoding
09-26 07:11:17.405  20397-20397/...﹕ Stop recognition

So, my question is: Is there anything I can do to control this "refresh rate"? Is this caused by something I'm doing wrong in my RecognitionListener implementation (see below, but note - I typically don't get any partial results between utterances.)? Or is there a PocketSphinx API call that I don't know about to set this refresh rate? Or, is there something I could change in the PocketSphinx source to improve this behavior?

class VoiceListener implements RecognitionListener{

        private boolean isCommand = false;

        @Override
        public void onBeginningOfSpeech() {
            Log.d(TAG,"Beginning of Speech");
            // do nothing
        }

        @Override
        public void onEndOfSpeech() {
            Log.d(TAG,"End of Speech");
            // do nothing
        }

        @Override
        public void onPartialResult(Hypothesis arg0) {
            if( arg0 != null){
                Log.d(TAG, "Partial results list: " + arg0.getHypstr());

                isCommand = false;

                // handle recognition results for keywords
                for( String command : this.getCurrentCommands() ) {
                    if (arg0.getHypstr().contains(command)) {
                        this.onRecognition(arg0.getHypStr());
                        isCommand = true;
                        mRecognizer.stop();
                    }
                }

                // call stop, and let onResults() handle grammar results
                if( arg0.getHypstr().contains(Command.STOP_WORD))
                    mRecognizer.stop();

            }
        }

        @Override
        public void onResult(Hypothesis results) {

            String data;
            if( results == null ){
                data = null;
            }else{
                data = results.getHypstr();
            }

            Log.d(TAG,"Final results: " + data );

            // handle grammar recognition results
            if( !isCommand ){
                this.onRecognition(data);
            }
            return;

        }

Solution

  • There is no such thing as "refresh rate". Recognition accuracy drops probably because you have some noise on the background and it is not properly filtered out. You can study raw dumps to investigate if silence is counted as speech. You can share raw audio dumps to get help on this issue.

    In your code there are things which are not very reasonable. If you are using keyword spotting only, there is no need to stop and restart the recognizer in onEndOfSpeech as you are doing now, you could just skip it. In spotting mode you do not need to wait for the end of speech to get a result, you can just use partial result to invoke actions and restart recognizer.