I'm trying to optimise Speech to Text calling values in a Node.js application. I'm trying to determine if they are currently best practice.
I understand Speech to Text recommend LINEAR16 encoding with 16,000Hz sample rate, but this isn't possible for VOIP which is sent at 8000hz and currently Twilio only offer encoding in MULAW.
What I'm wanting to find out is the values being used for "model" "use_enhanced" and "confidence" are good?
if (this.newStreamRequired()) {
if (this.stream) {
this.stream.destroy();
}
var request = {
config: {
encoding: "MULAW",
sampleRateHertz: 8000,
languageCode: "en-US",
model: 'phone_call',
use_enhanced: true,
confidence: 1.0
},
single_utterance: false,
interimResults: false,
is_final: true
};
this.streamCreatedAt = new Date();
this.stream = speech
.streamingRecognize(request)
.on("error", console.error)
.on("data", (data) => {
const result = data.results[0];
if (result === undefined || result.alternatives[0] === undefined) {
return;
}
this.emit('transcription', result.alternatives[0].transcript);
});
}
In general it is difficult to assess if your options are indeed the best. The best approach you can take is to study the alternatives, run a couple tests and stick to the parameters that yield the best results.
In any case, let’s examine your particular case:
phone_call
as stated in here. The other alternatives are better fits for 16000Hz audio.phone call
model (see).All in all, I think your parameters on the request have the proper values except for the confidence value which might not be fit for the request parameters.