I am using the google speech to text API in Node.js.
I'm doing the following
googleSpeechClient.streamingRecognize({
config: {
encoding: 'LINEAR16',
sampleRateHertz: 16000,
languageCode: 'en-US',
enableAutomaticPunctuation: true,
metadata: {
interactionType: 'PHONE_CALL',
microphoneDistance: 'NEARFIELD',
originalMediaType: 'VIDEO',
recordingDeviceType: 'PC'
},
model: 'video',
useEnhanced: true,
enableWordConfidence: true,
enableWordTimeOffsets: true,
diarizationConfig: {
enableSpeakerDiarization: true,
minSpeakerCount: 1,
maxSpeakerCount: 6
},
},
interimResults: true,
single_utterance: false
})
and when I give it a short clip from The Wolf of Wall Street, the responses I get are like this for the interim results:
{
results: [
{
alternatives: [{
words: [],
transcript: 'Hey John, thank you for your vote of confidence and welcome to the',
confidence: 0
}],
isFinal: false,
stability: 0.8999999761581421,
resultEndTime: [Object],
channelTag: 0,
languageCode: 'en-us'
},
{
alternatives: [{ words: [], transcript: ' investor Center.', confidence: 0 }],
isFinal: false,
stability: 0.009999999776482582,
resultEndTime: [Object],
channelTag: 0,
languageCode: 'en-us'
}
],
error: null,
speechEventType: 'SPEECH_EVENT_UNSPECIFIED'
}
and like this for the results marked as final:
{
words: [
{
startTime: [Object],
endTime: [Object],
word: 'Hey',
confidence: 0.550264298915863,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'John,',
confidence: 0.7241439819335938,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'thank',
confidence: 0.9128385782241821,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'you',
confidence: 0.7003968358039856,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'for',
confidence: 0.7170425057411194,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'your',
confidence: 0.9128385782241821,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'vote',
confidence: 0.7738808989524841,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'of',
confidence: 0.7003968358039856,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'confidence',
confidence: 0.5876403450965881,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'and',
confidence: 0.9128385782241821,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'welcome',
confidence: 0.9128385782241821,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'to',
confidence: 0.7243974208831787,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'the',
confidence: 0.657508909702301,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'investors',
confidence: 0.6374689936637878,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'Center.',
confidence: 0.7192383408546448,
speakerTag: 0
},
{
startTime: [Object],
endTime: [Object],
word: 'Bye-bye.',
confidence: 0.6980124115943909,
speakerTag: 0
}
],
transcript: 'Hey John, thank you for your vote of confidence and welcome to the investors Center. Bye-bye.',
confidence: 0.7401091456413269
}
Is there any way to get the word confidences for the interim results? Thanks for any help or insights!
Unfortunately there is no way to get word confidences on interim results. The confidence is set up in a way that, it will be only populated when is_final=true. See document reference.
confidence - The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where is_final=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.
But you can try and create a speech to text API feature request to output the word confidence in the interim results.