javascript node.js async-await speech-recognition google-cloud-speech

How to end Google Speech-to-Text streamingRecognize gracefully and get back the pending text results?

I'd like to be able to end a Google speech-to-text stream (created with streamingRecognize), and get back the pending SR (speech recognition) results.

In a nutshell, the relevant Node.js code:

// create SR stream
const stream = speechClient.streamingRecognize(request);

// observe data event
const dataPromise = new Promise(resolve => stream.on('data', resolve));

// observe error event
const errorPromise = new Promise((resolve, reject) => stream.on('error', reject));

// observe finish event
const finishPromise = new Promise(resolve => stream.on('finish', resolve));

// send the audio
stream.write(audioChunk);

// for testing purposes only, give the SR stream 2 seconds to absorb the audio
await new Promise(resolve => setTimeout(resolve, 2000));

// end the SR stream gracefully, by observing the completion callback
const endPromise = util.promisify(callback => stream.end(callback))();

// a 5 seconds test timeout
const timeoutPromise = new Promise(resolve => setTimeout(resolve, 5000)); 

// finishPromise wins the race here
await Promise.race([
  dataPromise, errorPromise, finishPromise, endPromise, timeoutPromise]);

// endPromise wins the race here
await Promise.race([
  dataPromise, errorPromise, endPromise, timeoutPromise]);

// timeoutPromise wins the race here
await Promise.race([dataPromise, errorPromise, timeoutPromise]);

// I don't see any data or error events, dataPromise and errorPromise don't get settled

What I experience is that the SR stream ends successfully, but I don't get any data events or error events. Neither dataPromise nor errorPromise gets resolved or rejected.

How can I signal the end of my audio, close the SR stream and still get the pending SR results?

I need to stick with streamingRecognize API because the audio I'm streaming is real-time, even though it may stop suddenly.

To clarify, it works as long as I keep streaming the audio, I do receive the real-time SR results. However, when I send the final audio chunk and end the stream like above, I don't get the final results I'd expect otherwise.

To get the final results, I actually have to keep streaming silence for several more seconds, which may increase the ST bill. I feel like there must be a better way to get them.

Updated: so it appears, the only proper time to end a streamingRecognize stream is upon data event where StreamingRecognitionResult.is_final is true. As well, it appears we're expected to keep streaming audio until data event is fired, to get any result at all, final or interim.

This looks like a bug to me, filing an issue.

Updated: it now seems to have been confirmed as a bug. Until it's fixed, I'm looking for a potential workaround.

Updated: for future references, here is the list of the current and previously tracked issues involving streamingRecognize.

I'd expect this to be a common problem for those who use streamingRecognize, surprised it hasn't been reported before. Submitting it as a bug to issuetracker.google.com, as well.

Solution

My bad — unsurprisingly, this turned to be an obscure race condition in my code.

I've put together a self-contained sample that works as expected (gist). It helped me tracking down the issue. Hopefully, it may help others and my future self:

// A simple streamingRecognize workflow,
// tested with Node v15.0.1, by @noseratio

import fs from 'fs';
import path from "path";
import url from 'url'; 
import util from "util";
import timers from 'timers/promises';
import speech from '@google-cloud/speech';

export {}

// need a 16-bit, 16KHz raw PCM audio 
const filename = path.join(path.dirname(url.fileURLToPath(import.meta.url)), "sample.raw");
const encoding = 'LINEAR16';
const sampleRateHertz = 16000;
const languageCode = 'en-US';

const request = {
  config: {
    encoding: encoding,
    sampleRateHertz: sampleRateHertz,
    languageCode: languageCode,
  },
  interimResults: false // If you want interim results, set this to true
};

// init SpeechClient
const client = new speech.v1p1beta1.SpeechClient();
await client.initialize();

// Stream the audio to the Google Cloud Speech API
const stream = client.streamingRecognize(request);

// log all data
stream.on('data', data => {
  const result = data.results[0];
  console.log(`SR results, final: ${result.isFinal}, text: ${result.alternatives[0].transcript}`);
});

// log all errors
stream.on('error', error => {
  console.warn(`SR error: ${error.message}`);
});

// observe data event
const dataPromise = new Promise(resolve => stream.once('data', resolve));

// observe error event
const errorPromise = new Promise((resolve, reject) => stream.once('error', reject));

// observe finish event
const finishPromise = new Promise(resolve => stream.once('finish', resolve));

// observe close event
const closePromise = new Promise(resolve => stream.once('close', resolve));

// we could just pipe it: 
// fs.createReadStream(filename).pipe(stream);
// but we want to simulate the web socket data

// read RAW audio as Buffer
const data = await fs.promises.readFile(filename, null);

// simulate multiple audio chunks
console.log("Writting...");
const chunkSize = 4096;
for (let i = 0; i < data.length; i += chunkSize) {
  stream.write(data.slice(i, i + chunkSize));
  await timers.setTimeout(50);
}
console.log("Done writing.");

console.log("Before ending...");
await util.promisify(c => stream.end(c))();
console.log("After ending.");

// race for events
await Promise.race([
  errorPromise.catch(() => console.log("error")), 
  dataPromise.then(() => console.log("data")),
  closePromise.then(() => console.log("close")),
  finishPromise.then(() => console.log("finish"))
]);

console.log("Destroying...");
stream.destroy();
console.log("Final timeout...");
await timers.setTimeout(1000);
console.log("Exiting.");

The output:

Writting...
Done writing.
Before ending...
SR results, final: true, text:  this is a test I'm testing voice recognition This Is the End
After ending.
data
finish
Destroying...
Final timeout...
close
Exiting.

To test it, a 16-bit/16KHz raw PCM audio file is required. An arbitrary WAV file wouldn't work as is because it contains a header with metadata.