Search code examples
webspherespeech-to-textibm-speech-to-text

IBM Speech To Text - Get alternative with highest confidence and keyword found as first result


I'm using IBM Speech to Text. The results are OK, but I'm wondering why they are not sorted by highest confidence first. Is there a parameter returning this sorted, so that I could just pick the first alternative? Best would be to only return a result if also the passed keyword is also found.

There is a max_alternatives parameter defaulting to 1, but also when specifying this explicitly, more than one alternative is returned.

I'm currently sorting the response manually and I need no code sample for accomplishing this.

JSON example:

   "result": {
        "result_index": 0,
        "results": [
            {
                "final": true,
                "alternatives": [
                    {
                        "transcript": "l\u00f6schen es tut echte betroffen ",
                        "confidence": 0.71
                    }
                ],
                "keywords_result": {}
            },
            {
                "final": true,
                "alternatives": [
                    {
                        "transcript": "sie sp\u00fcren dass eine \u00e4ra zu ende ",
                        "confidence": 0.91
                    }
                ],
                "keywords_result": {}
            },
            {
                "final": true,
                "alternatives": [
                    {
                        "transcript": "auto fahre eins zwei drei vier ",
                        "confidence": 0.95
                    }
                ],
                "keywords_result": {
                    "auto": [
                        {
                            "start_time": 6.31,
                            "end_time": 7.19,
                            "confidence": 0.99,
                            "normalized_text": "auto"
                        }
                    ]
                }
            }
        ]
    },
...

Solution

  • The issue was the end_of_phrase_silence_time. When a default 0.8 silence period is detected, the speech is split into an additional phrase. So what I have seen is not a different recognition result, but an existing phrase in the audio recording mentioned before. See the parameter end_of_phrase_silence_time