google-cloud-platform speech-to-text google-cloud-speech

Do time offsets work during streaming audio transcriptions with Google Speech-To-Text?

Time offsets for streaming audio transcriptions through Google Speech-To-Text are not working for me. My configuration looks like this:

const request = {
  config: {
    model: 'phoneCall',
    maxAlternatives: 1, // for real-time, we always parse a single alternative.
    enableWordTimeOffsets: true,
    encoding: "MULAW",
    sampleRateHertz: 8000,
    languageCode: "en-GB"
  },
  interimResults: true
};

Once we get a handle on a WebSockets connection, we then set up our callback for transcriptions:

recognizeStream = client
  .streamingRecognize(request)
  .on("error", console.error)
  .on("data", data => {
    console.log(data.results[0].alternatives[0].transcript);
    for (v in data.results[0].alternatives[0]) {
      console.log(`v=${data.results[0].alternatives[0][v]}`);
    }
    data.results[0].alternatives[0].words.forEach(wordInfo => {
      // NOTE: If you have a time offset exceeding 2^32 seconds, use the
      // wordInfo.{x}Time.seconds.high to calculate seconds.
      const startSecs =
        `${wordInfo.startTime.seconds}` +
        '.' +
        wordInfo.startTime.nanos / 100000000;
      const endSecs =
        `${wordInfo.endTime.seconds}` +
        '.' +
        wordInfo.endTime.nanos / 100000000;
      console.log(`Word: ${wordInfo.word}`);
      console.log(`\t ${startSecs} secs - ${endSecs} secs`);
    });
  });

Then when we get audio chunks, we do this:

recognizeStream.write(msg.media.payload);

where msg is a JSON object parsed from a WebSockets message:

const msg = JSON.parse(message);

Unfortunately, the array data.results[0].alternatives[0].words is always empty, even though the real-time transcriptions are working as expected.

Has anyone verified that time offsets in fact work for streaming audio transcriptions with Google Speech-To-Text?

Incidentally, here is the git-repo for the nodejs API for Google Speech-To-Text.

Solution

The preponderance of evidence suggests that time offsets for words transcribed through Google Speech-To-Text are returned only when the bit is_final is True.

Said another way, timestamped word-boundaries for real-time transcriptions appear only to be available at the end of the transcription.

I know I am not the only API consumer out there asking for this feature. I can't imagine this is hard to do, and I suspect the fix would not break the current API.