How to send a Media message to Twilio in a bidirectional stream that Twilio can play?

I will try to be as succinct as I can here, but I have done quite a lot before getting to the point of posting a question on SO...

TLDR: I have audio that I get from Google's TTS API. I have a bi-directional stream from Twilio, to which I would like to send my tts audio via a Twilio Media message for Twilio to play for a caller. All I hear as a caller, however, is a loud, staticky screech. So my questions:

What am I missing? What have I not tried? How to troubleshoot this issue?
How can I test on my machine that mulaw/8000 bytes I want to send to Twilio will be playable? Why does Twilio not seem to agree with ffmpeg and sox, or with Audacity, about what is acceptable audio?
base64 encoding comes in several flavors: "standard," "standard-no-pad," "url-safe," "url-safe-no-pad," and probably others. Is there one that Twilio requires?

Some answers to likely questions:

Twilio specifies that the audio sent to it should be audio/x-mulaw-encoded at 8000Hz sample rate. Are you sure what you're sending meets that requirement? Yes. When making my TTS request to Google, I specify that I would like the audio mulaw-encoded at 8000Hz. Google sends audio back as base64-encoded bytes; when I decode the returned base64 string and save it to a file on disk, both ffprobe and soxi confirm that it is indeed mulaw-encoded and with a sample rate of 8000Hz.

When Google encodes its TTS audio as mulaw, it attaches a wav header to the result. Twilio says that media sent to it should not contain any headers. Are you sure you are only sending raw audio bytes to Twilio? Yes. When I get a result back from Google, I first decode the base64 string, then clip the first 44 bytes (the size of the wav header), and base64 encode only the remaining bytes to send to Twilio. I know that the bytes I have clipped are the right ones because I have written them to a file on disk, then imported them into Audacity as mulaw, 8000Hz raw audio data, and Audacity plays the audio just fine.

Are you base64-encoding your mulaw/8000 bytes correctly? I suppose this may be an unanswerable question (on my end), but I think so. If I write the base64 string (encoded with the "standard" engine) I send to Twilio to a file test.enc on disk, then run base64 test.enc -d > unk.dat, I can import unk.dat into Audacity as mulaw/8000 raw data and it plays. In my application, I have tried the gamut of common base64 engines: standard, standard-no-pad, url-safe, url-safe-no-pad. None of these produce good results.

Are you formatting your Media message to Twilio correctly? Yes. At least, the message I am sending looks like the example here.

Is the Media message you are sending to Twilio a WS Text message? Yes.

I remember seeing someone somewhere on the interwebs suggesting that it's a good idea to send a Mark message after a Media message to Twilio. Did you try this for giggles? Yes.

Some ugly Rust code:

    // Rust code that does and returns the equivalent of the steps at
    // https://cloud.google.com/text-to-speech/docs/create-audio-text-command-line#synthesize_audio_from_text
    let audio_config = AudioConfig {
        audio_encoding: Some(AudioConfigAudioEncoding::MULAW),
        sample_rate_hertz: Some(8_000),
        ..Default::default()
    };
    // Other things for Google TTS API...

    let payload = synthesize_response.audio_content.unwrap();
    println!("{payload}");
    // `payload` is the base64-encoded mulaw/8000 bytes plus a `wav` header

    // Base64-decode `payload`
    let mut enc = Cursor::new(payload);
    let mut decoder = read::DecoderReader::new(&mut enc, &engine::general_purpose::STANDARD);
    let mut body = Vec::new();
    decoder.read_to_end(&mut body).unwrap();
    // `body` is now raw u8's; if written to a file on disk, `ffprobe` and `soxi` recognize it as
    // mulaw/8000 audio; `play` can play it.

    // Trim `wav` header from Google's response
    let trimmed = &body[44..];
    // `trimmed` is headerless mulaw/8000 audio; if written to a disk, it can be imported into
    // Audacity as mulaw-encoded, 8000Hz audio and played.

    // base64-encode the trimmed raw audio
    let re_encoded: String = engine::general_purpose::STANDARD.encode(trimmed);
    // Construct a Media message to send to Twilio.
    let outbound_media_meta = OutboundMediaMeta {
        payload: re_encoded,
    };
    let outbound_media = TwilioOutbound::Media {
        media: outbound_media_meta,
        stream_sid: stream_sid.clone(),
    };
    let json = serde_json::to_string(&outbound_media).unwrap();
    println!("{json}");
    // We can verify that the json content is of the right format for a Media message to be
    // consumed by Twilio (not obvious here is that the `"event": "media"` tag is present).
    let message = Message::Text(json);

    sender.send(message).await.unwrap();

Solution

Twilio Technical Support Engineer here. From my experience working on Bi-directional Media Stream tickets, I've observed that the WAV headers that Google, Amazon, Microsoft etc send back are not 44 bytes but rather 58 bytes. It seems that the WAV standard allows for additional metadata which would increase the size.

As a test, what you can do is to convert the base64 response from Google and save the file as a WAV file on your machine. Take a look at the size of the file. Then, open the file in Audacity and in Audacity, Export the audio with a Header of RAW (header-less) and save it as another file on your machine. If you then check the file size and compare it with the original file, you should be able to work out the header size.

If you're still facing the same issue after removing 58 bytes, please enable Voice Trace on your account, make some new test calls and then raise a Support Ticket via the Console and we will be happy to check the Call SIDs.