machine-learning artificial-intelligence speech-to-text google-speech-to-text-api google-cloud-translate

Google cloud transcription API

I would like to calculate the time duration for every speaker in a two way conversation call with speaker tag, transcription, time stamp of speaker duration and confidence of it.

For example: I have mp3 file of a customer care support with 2 speaker count. I would like to know the time duration of the speaker with speaker tag, transcription and confidence of the transcription.

I am facing issues with end time and confidence of the transcription. I'm getting confidence as 0 in transcription and end time is not appropriate with actual end time.

audio link: https://drive.google.com/file/d/1OhwQ-xI7Rd-iKNj_dKP2unNxQzMIYlNW/view?usp=sharing

  **strong text**
  #!pip install --upgrade google-cloud-speech
from google.cloud import speech_v1p1beta1 as speech

import datetime     

tag=1

speaker=""

transcript = ''

client = speech.SpeechClient.from_service_account_file('#cloud_credentials')


audio = speech.types.RecognitionAudio(uri=gs_uri)

config = speech.types.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code='en-US',
enable_speaker_diarization=True,
enable_automatic_punctuation=True,
enable_word_time_offsets=True,
diarization_speaker_count=2,
use_enhanced=True,
model='phone_call',
profanity_filter=False,
enable_word_confidence=True)

print('Waiting for operation to complete…')

operation = client.long_running_recognize(config=config, audio=audio)

response = operation.result(timeout=100000)

with open('output_file.txt', "w") as text_file:

    for result in response.results:
        alternative = result.alternatives[0]
            confidence = result.alternatives[0].confidence
            current_speaker_tag=-1
            transcript = ""
            time = 0
            for word in alternative.words:
                if word.speaker_tag != current_speaker_tag:
                   if (transcript != ""):
                      print(u"Speaker {} - {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript, confidence), file=text_file)
                   transcript = ""
                   current_speaker_tag = word.speaker_tag
                   time = word.start_time.seconds

                transcript = transcript + " " + word.word
     if transcript != "":
         print(u"Speaker {} - {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript, confidence), file=text_file)
     print(u"Speech to text operation is completed, output file is created: {}".format('output_file.txt'))

Solution

Your code and screenshot in the question differ from each other. However from the screenshot it is understandable that you are creating individual speakers' speech using speech to text speaker diarization method.

Here you can’t calculate different confidence for each individual speaker because the response contains confidence value for each transcript and for individual words. A single transcript may or may not contain multiple speaker’s words depending on the audio.
Also as per the document the response contains all the words with speaker_tag in the last result list. From the doc

The transcript within each result is separate and sequential per result. However, the words list within an alternative includes all the words from all the results thus far. Thus, to get all the words with speaker tags, you only have to take the words list from the last result.

For the last result list confidence is 0. You can write the response in the console or any file and debug it yourself.

# Detects speech in the audio file
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result(timeout=10000)
 
# check the whole response
with open('output_file.txt', "w") as text_file:
   print(response,file=text_file)

Or you can also print individual transcript and confidence for better understanding .eg:

#confidence for each transcript
for result in response.results:
   alternative = result.alternatives[0]
   print("Transcript: {}".format(alternative.transcript))
   print("Confidence: {}".format(alternative.confidence))

For your duration issue with each speaker, you are calculating the start-time and end-time for each word, not for each individual speaker. The idea should something like this:-

Get the speaker’s first word’s start-time as duration start-time.
Always set every word’s end-time as duration end time ,because we don’t know whether the next word has a different speaker or not.
Look out for speaker change , if the speaker is the same then just add the words in the modified transcript otherwise do the same and also reset the start time for the new speaker. Eg:

tag=1
speaker=""
transcript = ''
start_time=""
end_time=""
 
for word_info in words_info:
   end_time = word_info.end_time.seconds   #tracking the end time of speech
   if start_time=='' :
       start_time = word_info.start_time.seconds #setting the value only for first time
   if word_info.speaker_tag==tag:
       speaker=speaker+" "+word_info.word
   else:
       transcript += "speaker {}: {}-{} - {}".format(tag,str(datetime.timedelta(seconds=start_time)),str(datetime.timedelta(seconds=end_time)),speaker) + '\n'
       tag=word_info.speaker_tag
       speaker=""+word_info.word
       start_time = word_info.start_time.seconds #resetting the starttime as we found a new speaker
 
transcript += "speaker {}: {}-{} - {}".format(tag,str(datetime.timedelta(seconds=start_time)),str(datetime.timedelta(seconds=end_time)),speaker) + '\n'

I have removed the confidence part in the modified transcript because it will always be 0. Also keep in mind that Speaker diarization is in still beta development and you might not get the exact desired output as you want.