javascript reactjs safari text-to-speech arraybuffer

Audio not playing in Safari for macOS/iOS

I'm using googles text-to-speech api on the backend, and sending to the frontend in the form of an ArrayBuffer. It then gets converted to a url that played with audio.play() This is working on chrome on mobile, windows, and macOS, but no luck in Safari.

I've seen a few threads similar to this one, and tried a few of the answers with no luck.

I've tried creating the audioPlayer when the component is created, and just changing the src in playVoice

playVoice is just called from a button onClick

The frontend functions look like:

  const playVoice = (text: string) => {
    getSpeech(text, sourceLanguage, "NEUTRAL").then((res) => {
      const audioPlayer = new Audio();
      audioPlayer.pause();
      audioPlayer.currentTime = 0;
      audioPlayer.src = convertAudio([res.data]);
      audioPlayer.play();
    });
  };

with getSpeech being an axios get request:

export const getSpeech = async (
  text: string,
  languageCode: string,
  voice: VoiceTypes
) => {
  return await axios({
    method: "GET",
    url: "/api/speech/",
    responseType: "blob",
    params: {
      text,
      languageCode,
      voice,
    },
  });
};

and convertAudio looks like

export const convertAudio = (buffer: ArrayBuffer[]): string => {
  return URL.createObjectURL(new Blob(buffer));
};

My backend looks something like

const textToSpeech = require("@google-cloud/text-to-speech");
const asyncHandler = require("express-async-handler");
const stream = require("stream");
const client = new textToSpeech.TextToSpeechClient(process.env.SERVICE_ACCOUNT);

const getVoice = asyncHandler(async (req, res) => {
  const { text, languageCode, voice } = req.query;

  const request = {
    input: { text },
    voice: { languageCode, ssmlGender: voice },
    audioConfig: { audioEncoding: "MP3" },
  };

  res.set({
    "Content-Type": "audio/mpeg",
    "Transfer-Encoding": "chunked",
  });

  const [response] = await client.synthesizeSpeech(request);
  const bufferStream = new stream.PassThrough();
  bufferStream.end(Buffer.from(response.audioContent));
  bufferStream.pipe(res);
});

Solution

A few notes about the code you've shown:

The HTMLAudioElement constructor accepts a string URL parameter, which is described to be used this way:

If a URL is specified, the browser begins to asynchronously load the media resource before returning the new object.

This is advantageous because it allows for using a streaming audio resource, from which playback can begin as soon as the browser has determined that enough data has been downloaded that the playback timeline can progress with the continued progressive download without interruption: all without needing to have downloaded the entire audio file in advance.

The code you've shown first downloads the entire audio file before beginning playback, but you can change this and respond to the canplaythrough event to begin playback at an earlier time by constructing a source URL instead of using axios to download the file.

From the event's documentation page:

The canplaythrough event is fired when the user agent can play the media, and estimates that enough data has been loaded to play the media up to its end without having to stop for further buffering of content.

You just need to create a function which will construct the appropriate URL — here's an example:

// You don't show this type, so here's an example:
type VoiceType = 'NEUTRAL';

function createSpeechUrl (
  text: string,
  languageCode: string,
  voice: VoiceType,
): URL {
  const url = new URL('/api/speech/', window.location.href);
  url.searchParams.set('text', text);
  url.searchParams.set('languageCode', languageCode);
  url.searchParams.set('voice', voice);
  return url;
}

Using the technique described above also has the benefit of not creating a new object URL for every speech instance. Each time you create an object URL from an audio blob, you are duplicating the required memory for the audio data. You don't show that you ever clean up any of the object URLs: if you aren't revoking them after playback concludes, this is causing a memory leak in your application. By not using object URLs, you avoid the entire scenario.

You tagged your question with reacjs, so — even though you don't show any React code — I assume you're using React. Below I've prepared a code snippet demonstrating the technique I described above with a simple button rendered by React. I tested this using Chrome and Safari (the browsers you named in the question text), and everything works as expected in those environments.

<div id="root"></div><script src="https://cdn.jsdelivr.net/npm/react@18.2.0/umd/react.development.js"></script><script src="https://cdn.jsdelivr.net/npm/react-dom@18.2.0/umd/react-dom.development.js"></script><script src="https://cdn.jsdelivr.net/npm/@babel/standalone@7.20.4/babel.min.js"></script><script>Babel.registerPreset('tsx', {presets: [[Babel.availablePresets['typescript'], {allExtensions: true, isTSX: true}]]});</script>
<style>button { font-family: sans-serif; font-size: 1rem; padding: 0.5rem; }</style>
<script type="text/babel" data-type="module" data-presets="tsx,react">

// You don't show this type, so here's an example:
type VoiceType = 'NEUTRAL';

function createSpeechUrl (
  text: string,
  languageCode: string,
  voice: VoiceType,
): URL {
  const url = new URL('/api/speech/', window.location.href);
  url.searchParams.set('text', text);
  url.searchParams.set('languageCode', languageCode);
  url.searchParams.set('voice', voice);
  return url;
}

// Since the Stack Overflow code snippet doesn't have access to your server,
// here is a substitute function pointing to a public, static mp3 URL:
function createSpeechUrlForStackOverflow (...params: any[]): URL {
  // A random doorbell audio sample I found on GitHub
  const url = new URL('https://raw.githubusercontent.com/prof3ssorSt3v3/media-sample-files/65dbf140bdf0e66e8373fccff580ac0ba043f9c4/doorbell.mp3');
  return url;
}

function playVoice (text: string): Promise<HTMLAudioElement> {
  const languageCode = 'en-US';
  const voice = 'NEUTRAL';

  // const url = createSpeechUrl(text, languageCode, voice);
  // Substitute for this SO code snippet:
  const url = createSpeechUrlForStackOverflow(text, languageCode, voice);

  // Instantiate the audio element with the source URL
  // so that it can stream the audio data as early as possible
  // (without waiting for the entire "file" to buffer)
  const audio = new Audio(url.href);

  // Return a promise with the result of attempting playback
  // after enough streaming data has been downloaded
  return new Promise<HTMLAudioElement>((resolve, reject) => audio.addEventListener(
    'canplaythrough',
    () => audio.play().then(() => resolve(audio)).catch(reject),
  ));
}

function App (): React.ReactElement {
  return (<button onClick={() => playVoice('ding-dong')}>Play "ding-dong"</button>);
}

const reactRoot = ReactDOM.createRoot(document.getElementById('root')!);

reactRoot.render(
  <React.StrictMode>
    <App />
  </React.StrictMode>
);

</script>

Code in TypeScript Playground