Stream audio back to Twilio via websocket connection

I'm trying out Twillio's Programmable Voice feature and have implemented basic audio stream processing by referring to this doc. I'm planning to stream audio back to Twillio using the same websocket and want Twillio to play that audio to the caller

Is there any way to achieve this?

This is how my TwiML bins app xml configuration looks like

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Start>
        <Stream url="wss://XXXXXXX.in.ngrok.io/media" />
    </Start>
     <Dial>+91**********</Dial>
</Response>

I referred to Twillio Bi-directional Media-Streams, but it doesn't specify in what format and with what structure I need to send audio bytes back to twillio

Also I found this question where in answer he says sending back audio stream back to twillio websocket is not possible.

Can I get some help here please, to understand how can I achieve this

Solution

Very late to answer, but it's possible. Refer to this section: https://www.twilio.com/docs/voice/twiml/stream#websocket-messages-to-twilio

Essentially, you're using the websocket to send back data across the connection. You should send back JSON in text/bytes mode.

Here's an example JSON that was provided in the docs:

{
  "event": "media",
  "streamSid": "MZ18ad3ab5a668481ce02b83e7395059f0",
  "media": {
    "payload": "a3242sadfasfa423242... (a base64 encoded string of 8000/mulaw)"
  }
}

I'm personally using Amazon Polly for TTS. Here is an example of how to use Polly:

class Manager:
    def __init__(self):
        self._exit_stack = AsyncExitStack()
        self._s3_client = None

    async def __aenter__(self):
        session = AioSession()
        self._s3_client = await self._exit_stack.enter_async_context(
            session.create_client("s3")
        )

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        await self._exit_stack.__aexit__(exc_type, exc_val, exc_tb)


async def create_client(service: str, session: AioSession, exit_stack: AsyncExitStack):
    client = await exit_stack.enter_async_context(session.create_client(service))
    return client


WORD = "<speak>"


async def synthesize_speech(text: str, voice_id: str = "Matthew"):
    session = AioSession()

    async with AsyncExitStack() as exit_stack:
        polly = await create_client("polly", session, exit_stack)
        try:
            response = await polly.synthesize_speech(
                Text=text,
                TextType="ssml" if WORD in text else "text",
                OutputFormat="pcm",
                VoiceId=voice_id,
                SampleRate="8000",
            )
        except (BotoCoreError, ClientError) as error:
            logger.error(error)
            raise HTTPException(500, "Failed to synthesize speech")
        else:
            mulaw_audio = await response["AudioStream"].read()
            audio = audioop.lin2ulaw(mulaw_audio, 2)
            base64_audio = base64.b64encode(audio).decode("utf-8")
            return base64_audio

And then here's the example of how to send back websocket data in FastAPI:

from fastapi import WebSocketDisconnect

@app.websocket("/stream")
async def websocket(ws: WebSocket):
    await ws.accept()
    stream_sid = None
    try:
        while True:
            packet = await ws.receive_json()
            if packet["event"] == "start":
                # Save the stream SID for later use
                # I would go as far as saving most of the start message
                stream_sid = packet["streamSid"]
                continue

            # Send audio back:
            await ws.send_json(
                {
                  "event": "media",
                  "streamSid": stream_sid,
                  "media": {
                    "payload": await synthesize_speech("Hello world!")
                  }
                }
            )
            # If you want to send multiple audio messages
            # You should send a mark message. You'll receive
            # a mark event back where you can send the next audio

    except WebSocketDisconnect:
        pass

I recommend after sending the media message that you send a mark message. This allows you to know when your audio is done playing. In that case, you can batch your audio requests to Amazon Polly and send them sequentially.