Search code examples
reactjsnode.jsamazon-s3audio-streamingamazon-polly

Play AWS-Polly-generated voices client-side in my web application


I am generating long-form content with AWS Polly using the AWS SDK for JavaScript. My content is over 3000 characters long, so I am using the long-form engine and saving all generated files in an S3 bucket (as required). Each audio file is under 4Mb in size (around 10 minutes of audio, at most).

I am able to generate the Polly files and save them, and I am able to see the contents of the s3 bucket, but I am not having luck retrieving or playing these files. What am I missing?

My project is a React/Node/Typescript web application. Right now I am running locally in a Docker container as I develop this feature.

I should note that I am new to AWS, so there may be basics that I'm missing.

I would like to either stream the content from Polly as it's generating, or at least stream it from s3 after it has completed generating.

First I tried using the Synthesize SpeechCommandOutput, and that response contained an AudioStream, which offered a function called transformToWebStream() ... but neither the AudioStream, nor the object returned from the transformToWebStream function, worked the way I would expect a readable stream to work (based on my experience with Node file handling and streaming).

  const playNarration = async () => {
    const stream: SynthesizeSpeechCommandOutput | undefined = await getAudiostream();
    if (stream) {
      console.log(stream);
      const webStream: ReadableStream | undefined = stream.AudioStream?.transformToWebStream();
      console.log(webStream);
      if (webStream) {
        webStream.on('data', (chunk: any) => { // THIS ERRORS, SAYS 'ON' IS NOT A FUNCTION
          console.log(chunk);
        });
      }
    }

I also tried using a StartSpeechSynthesisTaskCommand, grabbing the OutputUri from the SynthesisTask that is returned and sending that to an audio player (https://www.npmjs.com/package/react-h5-audio-player).

  static getAudiostream(article: IArticle): Promise<StartSpeechSynthesisTaskCommandOutput | undefined> {
    let streamUrl = '';
    if (NarrationProvider.pollyClient) {
      const bodyString = documentToPlainTextString(article.body);
      const narrationParams = {
        Engine: Engine.LONG_FORM,
        LanguageCode: LanguageCode.en_US,
        OutputFormat: OutputFormat.MP3,
        Text: bodyString,
        TextType: TextType.TEXT,
        VoiceId: VoiceId.Danielle,
        OutputS3BucketName: NarrationProvider.s3Bucket,
        OutputS3KeyPrefix: article.slug,
      };
      const command = new StartSpeechSynthesisTaskCommand(narrationParams);
      const stream = await NarrationProvider.pollyClient
        .send(command)
        .catch((error) => {
          throw error;
        });
      if (stream?.SynthesisTask?.OutputUri) {
          streamUrl = stream.SynthesisTask.OutputUri;
      }

    }
    return Promise.resolve(undefined);
  }

Just for the hell of it, I tried manually generating a presigned url for an s3 file and sending that to the audio player, and that didn't work either.

I can't possibly the only person wanting to put ai voices in their application, but I am not seeing any useful/recent answers here on Stack Overflow.


Solution

  • ReadableStream, as implemented in browser land, has differences from the way it was originally done in Node.js. So, there is no on method. You can see the documentation here: https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream

    Normally when using Polly this way, you can decode the buffer you get and play it back right away with an AudioContext.

    // Generate Speech
    const pollyRes = await pollyClient.send(
      new SynthesizeSpeechCommand({
        Engine: Engine.LONG_FORM,
        LanguageCode: LanguageCode.en_US,
        OutputFormat: OutputFormat.MP3,
        Text: bodyString,
        VoiceId: VoiceId.Danielle
      })
    );
    
    // Play Speech
    const audioContext = new AudioContext();
    const pollyBufferSourceNode = audioContext.createBufferSource();
    
    pollyBufferSourceNode.buffer = await audioContext.decodeAudioData(
      (await pollyRes.AudioStream.transformToByteArray()).buffer
    );
    
    pollyBufferSourceNode.connect(audioContext.destination);
    
    pollyBufferSourceNode.start();
    

    As for the S3 output, I haven't used that method but yes I would fully expect you could sign a GET URL and do something like...

    const a = new Audio(url);
    a.play(); // Must be done on user click or some other interactive event