Search code examples
pythonsplitnlpchat

How to split a conversation on WhatsApp in multiple blocks based on the context?


Let's imagine I download a csv file that includes all the conversations I had with a friend for the past 6 months (WhatsApp chat). I would like to divide that csv file in multiple "blocks" (each block defines a different conversation). Eg:

Day 1:

  • U1: Hey, how's going?
  • U2: Fine! Any plan for tomorrow?
  • U1: Nope

Day 2:

  • U2: Hello!

Day 3:

  • U1: Morning!
  • U2: ....

So the idea is to identify that in my WhatsApp Chat, if we follow the example I have provided, there should be 3 blocks of different conversations, two initiated by U1, and one initiated by U2.

I cannot split it by time because some of the users could take long enough to reply the previous message. So it seems I should be able to identify if the new sentence that appears in the chat is related to the previous "block" of conversation or if it is actually starting a new block.

Any ideas of what steps I need to follow if I want to identify different conversations in one chat, or if a sentence is continuing the previous conversation/starting a new one?

Thanks!!


Solution

  • I think even though you dont like time as the proxy for one conversation bloc, it might perform just as well as more complicated NLP.

    If you want to try sth more complicated, you would need some measure of semantic relatedness between texts. A classical method is to embedd your sentences/messages e.g. with sentence-BERT (see sbert.net) and use cosine similarity between sentences. you could say that a bloc ends once the embedding of the last sentence is too dissimilar from the preceeding sentence. Or you could even use BERT for next-sentence-prediction to test which sentences are plausible to follow others. But its unclear if this performs better than a simple time proxy. Sometimes simpler is better :)