Search code examples
pythontext-miningdoc2vecgensym

How to break conversation data into pairs of (Context , Response)


I'm using Gensim Doc2Vec model, trying to cluster portions of a customer support conversations. My goal is to give the support team an auto response suggestions.

Figure 1: shows a sample conversations where the user question is answered in the next conversation line, making it easy to extract the data:

Figure 1

during the conversation "hello" and "Our offices are located in NYC" should be suggested


Figure 2: describes a conversation where the questions and answers are not in sync

Figure 2

during the conversation "hello" and "Our offices are located in NYC" should be suggested


Figure 3: describes a conversation where the context for the answer is built over time, and for classification purpose (I'm assuming) some of the lines are redundant.

Figure 3

during the conversation "here is a link for the free trial account" should be suggested


I have the following data per conversation line (simplified):
who wrote the line (user or agent), text, time stamp

I'm using the following code to train my model:

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedLineDocument
import datetime

print('Creating documents',datetime.datetime.now().time())
context = TaggedLineDocument('./test_data/context.csv')

print('Building model',datetime.datetime.now().time())

model = Doc2Vec(context,size = 200, window = 10, min_count = 10, workers=4)
print('Training...',datetime.datetime.now().time())

for epoch in range(10):
    print('Run number :',epoch)
    model.train(context)

model.save('./test_data/model')

Q: How should I structure my training data and what heuristics could be applied in order to extract it from the raw data?


Solution

  • To train a model I would start by concatenating consecutive sequences of messages. What I would do is, using the timestamps, concatenate the messages without any message in between from the other entity.

    For instance:

    Hello
    I have a problem
    I cannot install software X
                                           Hi
                                           What error do you get?
    

    would be:

    Hello I have a problem I cannot install software X
                                           Hi What error do you get?
    

    Then I would train a model with sentences in that format. I would do that because I am assuming that the conversations have a "single topic" all the time between interactions from the entities. And in that scenario suggesting a single message Hi What error do you get? would be totally fine.

    Also, take a look at the data. If the questions from the users are usually single-sentenced (as in the examples) sentence detection could help a lot. In that case I would apply sentence detection on the concatenated strings (nltk could be an option) and use only single-sentenced questions for training. That way you can avoid the out-of-sync problem when training the model at the price of reducing the size of the dataset.

    On the other hand, I would really consider to start with a very simple method. For example you could score questions by tf-idf and, to get a suggestion, you can take the most similar question in your dataset wrt some metric (e.g. cosine similarity) and suggest the answer for that question. That will perform very bad in sentences with context information (e.g. how do you do it?) but can perform well in sentences like where are you based?.

    My last suggestion is because traditional methods perform even better than complex NN methods when the dataset is small. How big is your dataset?

    How you train a NN method is also crucial, there are a lot of hyper-parameters, and tuning them properly can be difficult, that's why having a baseline with a simple method can help you a lot to check how well you are doing. In this other paper they compare the different hyper-parameters for doc2vec, maybe you find it useful.

    Edit: a completely different option would be to train a model to "link" questions with answers. But for that you should manually tag each question with the corresponding answer and then train a supervised learning model on that data. That could potentially generalize better but with the added effort of manually labelling the sentences and still it doesn't look like an easy problem to me.