Search code examples
pythonnlptext-mining

Extract individual speech acts from call transcript


I have call transcript data as follow:

'[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. 
[0:00:10] spk1 : sure, let me know the issue'

I want the text data for spk1 separated from spk2.

I tried this

import re

text = "[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. [0:00:10] spk1 : sure, let me know the issue"

m = re.search('\](.+?)\[', text)
if m:
    found = m.group
found

But I am not getting the answer.


Solution

  • Assuming you want to keep order, time, speaker information and allow for some relatively dynamic orders (flexible number of speakers, same speaker is allowed to speak in two timestamps or more in a row):

    import re
    
    text = "[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. [0:00:10] spk1 : sure, let me know the issue"
    
    conversation_dict_list = []
    # iterate over tokens split by whitespaces
    for token in text.split(): 
        # timestamp: add new dict to list, add time and empty speaker and empty text 
        if re.fullmatch("\[\d+:\d\d:\d\d\]", token):
            conversation_dict_list.append({"time": token[1:-1], "speaker": None, "text": ""})
        # speaker: fill speaker field
        elif re.fullmatch("spk\d+", token):
            conversation_dict_list[-1]["speaker"] = token
        # text: keep concatenating to text field (plus whitespace)
        else:  
            conversation_dict_list[-1]["text"] += " " + token
    
    # remove leading " : " from each text
    conversation_dict_list = [{k_:(v_ if k_ != "text" else v_[3:]) for k_,v_ in d.items()} for d in conversation_dict_list]
    
    print(conversation_dict_list)
    

    Which returns:

    > [{'time': '0:00:00', 'speaker': 'spk1', 'text': 'Hi how are you'}, {'time': '0:00:02', 'speaker': 'spk2', 'text': 'I am good, need help on my phone.'}, {'time': '0:00:10', 'speaker': 'spk1', 'text': 'sure, let me know the issue'}]
    

    Obviously this will only work if you always have the exact pattern [h:mm:ss] spkX because if you have e.g. multiple speakers within the same timestamp the speaker would be overwritten with the last one.