speech-to-text google-speech-api google-cloud-speech opus google-speech-to-text-api

What does the google's speech-to-text configuration looks like for an .opus audio file

I am passing a .opus audio file to the google's speech-to-text api for transcription. I am using the following configurations:

encoding = enums.RecognitionConfig.AudioEncoding.OGG_OPUS
language_code = "en-US"
sample_rate_hertz = 16000

I am getting the following error:

google.api_core.exceptions.GoogleAPICallError: None Unable to recognize speech, possible error in encoding or channel config. Please correct the config and retry the request.

I've tried other encodings like FLAC and LINEAR16 and get None as outputs.

Does opus audio files require additional configuration field and how should the configuration file look like?

Solution

After working through the documentations provided by google and a couple of trys, I figured out the solution to the error I was getting. The OGG_OPUS encoding requires explicit configuration definition of audio_channel_count. In my case, the audio channels were 2 and I needed to explicitly define it. Also, in case of multi-channels, enable_separate_recognition_per_channel needs to be set to True.

The config that worked for me is :

encoding = enums.RecognitionConfig.AudioEncoding.OGG_OPUS
config = {
        "audio_channel_count": audio_channel_count,
        "enable_separate_recognition_per_channel": enable_separate_recognition_per_channel,
        "language_code": language_code,
        "sample_rate_hertz": sample_rate_hertz,
        "encoding": encoding
    }

It is very important that we use the correct values for each parameters in the config file.