Search code examples

Twilio Base64 Media Payload for Google Speech To Text API not Responding

I have a need to do some real time transcriptions from twilio phone calls using Google speech-to-text api and I've followed a few demo apps showing how to set this up. My application is in .net core 3.1 and I am using webhooks with a Twilio defined callback method. Upon retrieving the media from Twilio through the callback it is passed as Raw audio in encoded in base64 as you can see here.

I've referenced this demo on Live Transcribing as well and am trying to mimic the case statement in the c#. Everything connects correctly and the media and payload is passed into my app just fine from Twilio.

The audio string is then converted to a byte[] to pass to the Task that needs to transcribe the audio

 byte[] audioBytes = Convert.FromBase64String(info);

I am following the examples based of the Google docs that either stream from a file or an audio input (such as a microphone.) Where my use case is different is, I already have the bytes for each chunk of audio. The examples I referenced can be seen here. Transcribing audio from streaming input

Below is my implementation of the latter although using the raw audio bytes. This Task below is hit when the Twilio websocket connection hits the media event. I pass the payload directly into it. From my console logging I am getting to the Print Responses hit... console log, but it will NOT get into the while (await responseStream.MoveNextAsync()) block and log the transcript to the console. I do not get any errors back (that break the application.) Is this possible to even do? I have also tried loading the bytes into a memorystream object and passing them in as the Google doc examples do as well.

    static async Task<object> StreamingRecognizeAsync(byte[] audioBytes)

        var speech = SpeechClient.Create();
        var streamingCall = speech.StreamingRecognize();
        // Write the initial request with the config.
        await streamingCall.WriteAsync(
            new StreamingRecognizeRequest()
                StreamingConfig = new StreamingRecognitionConfig()
                    Config = new RecognitionConfig()
                        Encoding =
                        SampleRateHertz = 8000,
                        LanguageCode = "en",

                    InterimResults = true,
                    SingleUtterance = true
            }); ;
        // Print responses as they arrive.
        Task printResponses = Task.Run(async () =>
            Console.WriteLine("Print Responses hit...");
            var responseStream = streamingCall.GetResponseStream();

            while (await responseStream.MoveNextAsync())
                StreamingRecognizeResponse response = responseStream.Current;
                Console.WriteLine("Response stream moveNextAsync Hit...");
                foreach (StreamingRecognitionResult result in response.Results)
                    foreach (SpeechRecognitionAlternative alternative in result.Alternatives)
                        Console.WriteLine("Google transcript " + alternative.Transcript);
        //using (MemoryStream memStream = new MemoryStream(audioBytes))
        //    var buffer = new byte[32 * 1024];
        //    int bytesRead;
        //    while ((bytesRead = await memStream.ReadAsync(audioBytes, 0, audioBytes.Length)) > 0)
        //    {
        //        await streamingCall.WriteAsync(
        //            new StreamingRecognizeRequest()
        //            {
        //                AudioContent = Google.Protobuf.ByteString
        //                .CopyFrom(buffer, 0, bytesRead),
        //            });
        //    }

        await streamingCall.WriteAsync(
                   new StreamingRecognizeRequest()
                       AudioContent = Google.Protobuf.ByteString
        await streamingCall.WriteCompleteAsync();
        await printResponses;
        return 0;


  • After all this, I discovered that this code works fine, just needs to be broken up and called in different events in the Twilio stream lifecycle. The config section needs to be placed during the connected event. The print messages task needs to be placed in the media event. Then, the WriteCompleteAsync needs to be placed in the stop event when the websocket is closed from Twilio.

    One other important item to consider are the number of requests being sent to Google STT to ensure that too many requests aren't overloading the quota which seems to be (for now) 300 requests / minute.