I would like to calculate the time duration for every speaker in a two way conversation call with speaker tag, transcription, time stamp of speaker duration and confidence of it.
For example: I have mp3 file of a customer care support with 2 speaker count. I would like to know the time duration of the speaker with speaker tag, transcription and confidence of the transcription.
I am facing issues with end time and confidence of the transcription. I'm getting confidence as 0 in transcription and end time is not appropriate with actual end time.
audio link: https://drive.google.com/file/d/1OhwQ-xI7Rd-iKNj_dKP2unNxQzMIYlNW/view?usp=sharing
**strong text**
#!pip install --upgrade google-cloud-speech
from google.cloud import speech_v1p1beta1 as speech
import datetime
tag=1
speaker=""
transcript = ''
client = speech.SpeechClient.from_service_account_file('#cloud_credentials')
audio = speech.types.RecognitionAudio(uri=gs_uri)
config = speech.types.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code='en-US',
enable_speaker_diarization=True,
enable_automatic_punctuation=True,
enable_word_time_offsets=True,
diarization_speaker_count=2,
use_enhanced=True,
model='phone_call',
profanity_filter=False,
enable_word_confidence=True)
print('Waiting for operation to complete…')
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result(timeout=100000)
with open('output_file.txt', "w") as text_file:
for result in response.results:
alternative = result.alternatives[0]
confidence = result.alternatives[0].confidence
current_speaker_tag=-1
transcript = ""
time = 0
for word in alternative.words:
if word.speaker_tag != current_speaker_tag:
if (transcript != ""):
print(u"Speaker {} - {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript, confidence), file=text_file)
transcript = ""
current_speaker_tag = word.speaker_tag
time = word.start_time.seconds
transcript = transcript + " " + word.word
if transcript != "":
print(u"Speaker {} - {} - {} - {}".format(current_speaker_tag, str(datetime.timedelta(seconds=time)), transcript, confidence), file=text_file)
print(u"Speech to text operation is completed, output file is created: {}".format('output_file.txt'))
Your code and screenshot in the question differ from each other. However from the screenshot it is understandable that you are creating individual speakers' speech using speech to text speaker diarization
method.
Here you can’t calculate different confidence for each individual speaker because the response
contains confidence
value for each transcript and for individual words. A single transcript may or may not contain multiple speaker’s words depending on the audio.
Also as per the document the response
contains all the words
with speaker_tag
in the last result list. From the doc
The transcript within each result is separate and sequential per result. However, the words list within an alternative includes all the words from all the results thus far. Thus, to get all the words with speaker tags, you only have to take the words list from the last result.
For the last result list confidence is 0. You can write the response in the console or any file and debug it yourself.
# Detects speech in the audio file
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result(timeout=10000)
# check the whole response
with open('output_file.txt', "w") as text_file:
print(response,file=text_file)
Or you can also print individual transcript and confidence for better understanding .eg:
#confidence for each transcript
for result in response.results:
alternative = result.alternatives[0]
print("Transcript: {}".format(alternative.transcript))
print("Confidence: {}".format(alternative.confidence))
For your duration issue with each speaker, you are calculating the start-time and end-time for each word, not for each individual speaker. The idea should something like this:-
tag=1
speaker=""
transcript = ''
start_time=""
end_time=""
for word_info in words_info:
end_time = word_info.end_time.seconds #tracking the end time of speech
if start_time=='' :
start_time = word_info.start_time.seconds #setting the value only for first time
if word_info.speaker_tag==tag:
speaker=speaker+" "+word_info.word
else:
transcript += "speaker {}: {}-{} - {}".format(tag,str(datetime.timedelta(seconds=start_time)),str(datetime.timedelta(seconds=end_time)),speaker) + '\n'
tag=word_info.speaker_tag
speaker=""+word_info.word
start_time = word_info.start_time.seconds #resetting the starttime as we found a new speaker
transcript += "speaker {}: {}-{} - {}".format(tag,str(datetime.timedelta(seconds=start_time)),str(datetime.timedelta(seconds=end_time)),speaker) + '\n'
I have removed the confidence part in the modified transcript because it will always be 0. Also keep in mind that Speaker diarization
is in still beta
development and you might not get the exact desired output as you want.