Search code examples
javascriptnode.jsasync-awaitspeech-recognitiongoogle-cloud-speech

How to end Google Speech-to-Text streamingRecognize gracefully and get back the pending text results?


I'd like to be able to end a Google speech-to-text stream (created with streamingRecognize), and get back the pending SR (speech recognition) results.

In a nutshell, the relevant Node.js code:

// create SR stream
const stream = speechClient.streamingRecognize(request);

// observe data event
const dataPromise = new Promise(resolve => stream.on('data', resolve));

// observe error event
const errorPromise = new Promise((resolve, reject) => stream.on('error', reject));

// observe finish event
const finishPromise = new Promise(resolve => stream.on('finish', resolve));

// send the audio
stream.write(audioChunk);

// for testing purposes only, give the SR stream 2 seconds to absorb the audio
await new Promise(resolve => setTimeout(resolve, 2000));

// end the SR stream gracefully, by observing the completion callback
const endPromise = util.promisify(callback => stream.end(callback))();

// a 5 seconds test timeout
const timeoutPromise = new Promise(resolve => setTimeout(resolve, 5000)); 

// finishPromise wins the race here
await Promise.race([
  dataPromise, errorPromise, finishPromise, endPromise, timeoutPromise]);

// endPromise wins the race here
await Promise.race([
  dataPromise, errorPromise, endPromise, timeoutPromise]);

// timeoutPromise wins the race here
await Promise.race([dataPromise, errorPromise, timeoutPromise]);

// I don't see any data or error events, dataPromise and errorPromise don't get settled

What I experience is that the SR stream ends successfully, but I don't get any data events or error events. Neither dataPromise nor errorPromise gets resolved or rejected.

How can I signal the end of my audio, close the SR stream and still get the pending SR results?

I need to stick with streamingRecognize API because the audio I'm streaming is real-time, even though it may stop suddenly.

To clarify, it works as long as I keep streaming the audio, I do receive the real-time SR results. However, when I send the final audio chunk and end the stream like above, I don't get the final results I'd expect otherwise.

To get the final results, I actually have to keep streaming silence for several more seconds, which may increase the ST bill. I feel like there must be a better way to get them.

Updated: so it appears, the only proper time to end a streamingRecognize stream is upon data event where StreamingRecognitionResult.is_final is true. As well, it appears we're expected to keep streaming audio until data event is fired, to get any result at all, final or interim.

This looks like a bug to me, filing an issue.

Updated: it now seems to have been confirmed as a bug. Until it's fixed, I'm looking for a potential workaround.

Updated: for future references, here is the list of the current and previously tracked issues involving streamingRecognize.

I'd expect this to be a common problem for those who use streamingRecognize, surprised it hasn't been reported before. Submitting it as a bug to issuetracker.google.com, as well.


Solution

  • My bad — unsurprisingly, this turned to be an obscure race condition in my code.

    I've put together a self-contained sample that works as expected (gist). It helped me tracking down the issue. Hopefully, it may help others and my future self:

    // A simple streamingRecognize workflow,
    // tested with Node v15.0.1, by @noseratio
    
    import fs from 'fs';
    import path from "path";
    import url from 'url'; 
    import util from "util";
    import timers from 'timers/promises';
    import speech from '@google-cloud/speech';
    
    export {}
    
    // need a 16-bit, 16KHz raw PCM audio 
    const filename = path.join(path.dirname(url.fileURLToPath(import.meta.url)), "sample.raw");
    const encoding = 'LINEAR16';
    const sampleRateHertz = 16000;
    const languageCode = 'en-US';
    
    const request = {
      config: {
        encoding: encoding,
        sampleRateHertz: sampleRateHertz,
        languageCode: languageCode,
      },
      interimResults: false // If you want interim results, set this to true
    };
    
    // init SpeechClient
    const client = new speech.v1p1beta1.SpeechClient();
    await client.initialize();
    
    // Stream the audio to the Google Cloud Speech API
    const stream = client.streamingRecognize(request);
    
    // log all data
    stream.on('data', data => {
      const result = data.results[0];
      console.log(`SR results, final: ${result.isFinal}, text: ${result.alternatives[0].transcript}`);
    });
    
    // log all errors
    stream.on('error', error => {
      console.warn(`SR error: ${error.message}`);
    });
    
    // observe data event
    const dataPromise = new Promise(resolve => stream.once('data', resolve));
    
    // observe error event
    const errorPromise = new Promise((resolve, reject) => stream.once('error', reject));
    
    // observe finish event
    const finishPromise = new Promise(resolve => stream.once('finish', resolve));
    
    // observe close event
    const closePromise = new Promise(resolve => stream.once('close', resolve));
    
    // we could just pipe it: 
    // fs.createReadStream(filename).pipe(stream);
    // but we want to simulate the web socket data
    
    // read RAW audio as Buffer
    const data = await fs.promises.readFile(filename, null);
    
    // simulate multiple audio chunks
    console.log("Writting...");
    const chunkSize = 4096;
    for (let i = 0; i < data.length; i += chunkSize) {
      stream.write(data.slice(i, i + chunkSize));
      await timers.setTimeout(50);
    }
    console.log("Done writing.");
    
    console.log("Before ending...");
    await util.promisify(c => stream.end(c))();
    console.log("After ending.");
    
    // race for events
    await Promise.race([
      errorPromise.catch(() => console.log("error")), 
      dataPromise.then(() => console.log("data")),
      closePromise.then(() => console.log("close")),
      finishPromise.then(() => console.log("finish"))
    ]);
    
    console.log("Destroying...");
    stream.destroy();
    console.log("Final timeout...");
    await timers.setTimeout(1000);
    console.log("Exiting.");
    

    The output:

    Writting...
    Done writing.
    Before ending...
    SR results, final: true, text:  this is a test I'm testing voice recognition This Is the End
    After ending.
    data
    finish
    Destroying...
    Final timeout...
    close
    Exiting.
    

    To test it, a 16-bit/16KHz raw PCM audio file is required. An arbitrary WAV file wouldn't work as is because it contains a header with metadata.