Time offsets for streaming audio transcriptions through Google Speech-To-Text are not working for me. My configuration looks like this:
const request = {
config: {
model: 'phoneCall',
maxAlternatives: 1, // for real-time, we always parse a single alternative.
enableWordTimeOffsets: true,
encoding: "MULAW",
sampleRateHertz: 8000,
languageCode: "en-GB"
interimResults: true
Once we get a handle on a WebSockets connection, we then set up our callback for transcriptions:
recognizeStream = client
.on("error", console.error)
.on("data", data => {
for (v in data.results[0].alternatives[0]) {
data.results[0].alternatives[0].words.forEach(wordInfo => {
// NOTE: If you have a time offset exceeding 2^32 seconds, use the
// wordInfo.{x}Time.seconds.high to calculate seconds.
const startSecs =
`${wordInfo.startTime.seconds}` +
'.' +
wordInfo.startTime.nanos / 100000000;
const endSecs =
`${wordInfo.endTime.seconds}` +
'.' +
wordInfo.endTime.nanos / 100000000;
console.log(`Word: ${wordInfo.word}`);
console.log(`\t ${startSecs} secs - ${endSecs} secs`);
Then when we get audio chunks, we do this:
where msg
is a JSON object parsed from a WebSockets message:
const msg = JSON.parse(message);
Unfortunately, the array data.results[0].alternatives[0].words
is always empty, even though the real-time transcriptions are working as expected.
Has anyone verified that time offsets in fact work for streaming audio transcriptions with Google Speech-To-Text?
Incidentally, here is the git-repo for the nodejs API for Google Speech-To-Text.
The preponderance of evidence suggests that time offsets for words transcribed through Google Speech-To-Text are returned only when the bit is_final
is True
Said another way, timestamped word-boundaries for real-time transcriptions appear only to be available at the end of the transcription.
I know I am not the only API consumer out there asking for this feature. I can't imagine this is hard to do, and I suspect the fix would not break the current API.