Search code examples
machine-learningartificial-intelligencespeech-to-textgoogle-speech-to-text-apigoogle-cloud-translate

Google cloud transcription API


I would like to calculate the time duration for every speaker in a two way conversation call with speaker tag, transcription, time stamp of speaker duration and confidence of it.

For example: I have mp3 file of a customer care support with 2 speaker count. I would like to know the time duration of the speaker with speaker tag, transcription and confidence of the transcription.

I am facing issues with end time and confidence of the transcription. I'm getting confidence as 0 in transcription and end time is not appropriate with actual end time.

audio link: https://drive.google.com/file/d/1OhwQ-xI7Rd-iKNj_dKP2unNxQzMIYlNW/view?usp=sharing

  **strong text**
  #!pip install --upgrade google-cloud-speech
from google.cloud import speech_v1p1beta1 as speech

import datetime     

tag=1

speaker=""

transcript = ''

client = speech.SpeechClient.from_service_account_file('#cloud_credentials')


audio = speech.types.RecognitionAudio(uri=gs_uri)

config = speech.types.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code='en-US',
enable_speaker_diarization=True,
enable_automatic_punctuation=True,
enable_word_time_offsets=True,
diarization_speaker_count=2,
use_enhanced=True,
model='phone_call',
profanity_filter=False,
enable_word_confidence=True)

print('Waiting for operation to complete…')

operation = client.long_running_recognize(config=config, audio=audio)

response = operation.result(timeout=100000)

with open('output_file.txt', "w") as text_file:

    for result in response.results:
        alternative = result.alternatives[0]
            confidence = result.alternatives[0].confidence
            current_speaker_tag=-1
            transcript = ""
            time = 0
            for word in alternative.words:
                if word.speaker_tag != current_speaker_tag:
                   if (transcript != ""):
                      print(u"Speaker {} - {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript, confidence), file=text_file)
                   transcript = ""
                   current_speaker_tag = word.speaker_tag
                   time = word.start_time.seconds

                transcript = transcript + " " + word.word
     if transcript != "":
         print(u"Speaker {} - {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript, confidence), file=text_file)
     print(u"Speech to text operation is completed, output file is created: {}".format('output_file.txt'))

enter image description here


Solution

  • Your code and screenshot in the question differ from each other. However from the screenshot it is understandable that you are creating individual speakers' speech using speech to text speaker diarization method.

    Here you can’t calculate different confidence for each individual speaker because the response contains confidence value for each transcript and for individual words. A single transcript may or may not contain multiple speaker’s words depending on the audio.
    Also as per the document the response contains all the words with speaker_tag in the last result list. From the doc

    The transcript within each result is separate and sequential per result. However, the words list within an alternative includes all the words from all the results thus far. Thus, to get all the words with speaker tags, you only have to take the words list from the last result.

    For the last result list confidence is 0. You can write the response in the console or any file and debug it yourself.

    # Detects speech in the audio file
    operation = client.long_running_recognize(config=config, audio=audio)
    response = operation.result(timeout=10000)
     
    # check the whole response
    with open('output_file.txt', "w") as text_file:
       print(response,file=text_file)
    
    

    Or you can also print individual transcript and confidence for better understanding .eg:

    #confidence for each transcript
    for result in response.results:
       alternative = result.alternatives[0]
       print("Transcript: {}".format(alternative.transcript))
       print("Confidence: {}".format(alternative.confidence))
    

    For your duration issue with each speaker, you are calculating the start-time and end-time for each word, not for each individual speaker. The idea should something like this:-

    1. Get the speaker’s first word’s start-time as duration start-time.
    2. Always set every word’s end-time as duration end time ,because we don’t know whether the next word has a different speaker or not.
    3. Look out for speaker change , if the speaker is the same then just add the words in the modified transcript otherwise do the same and also reset the start time for the new speaker. Eg:
    tag=1
    speaker=""
    transcript = ''
    start_time=""
    end_time=""
     
    for word_info in words_info:
       end_time = word_info.end_time.seconds   #tracking the end time of speech
       if start_time=='' :
           start_time = word_info.start_time.seconds #setting the value only for first time
       if word_info.speaker_tag==tag:
           speaker=speaker+" "+word_info.word
       else:
           transcript += "speaker {}: {}-{} - {}".format(tag,str(datetime.timedelta(seconds=start_time)),str(datetime.timedelta(seconds=end_time)),speaker) + '\n'
           tag=word_info.speaker_tag
           speaker=""+word_info.word
           start_time = word_info.start_time.seconds #resetting the starttime as we found a new speaker
     
    transcript += "speaker {}: {}-{} - {}".format(tag,str(datetime.timedelta(seconds=start_time)),str(datetime.timedelta(seconds=end_time)),speaker) + '\n'
    
    
    

    I have removed the confidence part in the modified transcript because it will always be 0. Also keep in mind that Speaker diarization is in still beta development and you might not get the exact desired output as you want.