audio google-cloud-platform speech-to-text google-speech-api

Google Speech API streaming audio exceeding 1 minute

I would like to be able to extract utternaces of a person from a stream of telephone audio. The phone audio is routed to my server which then creates a streaming recognition request. How can I tell when a word exists as part of a complete utterance or is part of an utterance currently being transcribed? Should I compare timestamps between words? Will the API continue to return interim results even if there is no speech for a certain amount of time in the streaming phone audio? How can I exceed the 1-minute of streaming audio limit?

Solution

About your first 3 questions:

You don’t need to compare timestamps between words, you can tell if a word is part of a complete utterance (final result) by looking at the is_final flag in the Streaming Recognition Result. If the flag is set to true, the response corresponds to a completed transcription, otherwise, it is an interim result. More on this here.

Once you get the final results, no interim results should be generated until new utterances are streamed.

Regarding your last question, you can’t exceed the 1 minute limit, you need to send multiple requests instead.