Search code examples
google-cloud-platformspeech-to-textgoogle-speech-to-text-api

Google Speech to Text optimal values


I'm trying to optimise Speech to Text calling values in a Node.js application. I'm trying to determine if they are currently best practice.

I understand Speech to Text recommend LINEAR16 encoding with 16,000Hz sample rate, but this isn't possible for VOIP which is sent at 8000hz and currently Twilio only offer encoding in MULAW.

What I'm wanting to find out is the values being used for "model" "use_enhanced" and "confidence" are good?

if (this.newStreamRequired()) {
  if (this.stream) {
    this.stream.destroy();
  }

  var request = {
    config: {
      encoding: "MULAW",
      sampleRateHertz: 8000,
      languageCode: "en-US",
      model: 'phone_call',
      use_enhanced: true,
      confidence: 1.0
    },
    single_utterance: false,
    interimResults: false,
    is_final: true
    
  };

  this.streamCreatedAt = new Date();
  this.stream = speech
    .streamingRecognize(request)
    .on("error", console.error)
    .on("data", (data) => {
      const result = data.results[0];
       if (result === undefined || result.alternatives[0] === undefined) {
         return;
       } 
      this.emit('transcription', result.alternatives[0].transcript);
    });
}

Solution

  • In general it is difficult to assess if your options are indeed the best. The best approach you can take is to study the alternatives, run a couple tests and stick to the parameters that yield the best results.

    In any case, let’s examine your particular case:

    • Model: The best model for 8000Hz is phone_call as stated in here. The other alternatives are better fits for 16000Hz audio.
    • Use_enhanced: The only options are true/false. It should be easy to run tests with both approaches. On paper, using an enhanced model should yield better results, especially for phone call model (see).
    • Confidence: This field is typically a value in the response, I don’t think it can be included in the default request config. Notice that a streaming config is based upon the default config.

    All in all, I think your parameters on the request have the proper values except for the confidence value which might not be fit for the request parameters.