I have call transcript data as follow:
'[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone.
[0:00:10] spk1 : sure, let me know the issue'
I want the text data for spk1
separated from spk2
.
I tried this
import re
text = "[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. [0:00:10] spk1 : sure, let me know the issue"
m = re.search('\](.+?)\[', text)
if m:
found = m.group
found
But I am not getting the answer.
Assuming you want to keep order, time, speaker information and allow for some relatively dynamic orders (flexible number of speakers, same speaker is allowed to speak in two timestamps or more in a row):
import re
text = "[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. [0:00:10] spk1 : sure, let me know the issue"
conversation_dict_list = []
# iterate over tokens split by whitespaces
for token in text.split():
# timestamp: add new dict to list, add time and empty speaker and empty text
if re.fullmatch("\[\d+:\d\d:\d\d\]", token):
conversation_dict_list.append({"time": token[1:-1], "speaker": None, "text": ""})
# speaker: fill speaker field
elif re.fullmatch("spk\d+", token):
conversation_dict_list[-1]["speaker"] = token
# text: keep concatenating to text field (plus whitespace)
else:
conversation_dict_list[-1]["text"] += " " + token
# remove leading " : " from each text
conversation_dict_list = [{k_:(v_ if k_ != "text" else v_[3:]) for k_,v_ in d.items()} for d in conversation_dict_list]
print(conversation_dict_list)
Which returns:
> [{'time': '0:00:00', 'speaker': 'spk1', 'text': 'Hi how are you'}, {'time': '0:00:02', 'speaker': 'spk2', 'text': 'I am good, need help on my phone.'}, {'time': '0:00:10', 'speaker': 'spk1', 'text': 'sure, let me know the issue'}]
Obviously this will only work if you always have the exact pattern [h:mm:ss] spkX
because if you have e.g. multiple speakers within the same timestamp the speaker would be overwritten with the last one.