Search code examples
pythonpython-3.xnlpgensimdoc2vec

Can gensim Doc2Vec be used to compare a novel document to a trained model?


I have a set of documents that all fit a pre-defined category and have successfully trained a model off of those documents.

The question is, if I have a novel document, how can I calculate how closely this new document lines up with my trained model?

My current solution:

novel_vector = model.infer_vector(novel_doc_words, steps = 20)
similarity_scores = model.docvecs.most_similar([novel_vector])
average = 0
for score in similarity_scores:
  average += score[1]
overall_similarity = average/len(similarity_scores)

I was unable to find any convenience methods in the documentation


Solution

  • There's no built-in method to check this sort of "lines up with" value, with respect to the whole model.

    A more typical approach, matching existing capabilities, would be to train a model on a diversity of documents – not just those in a specific category. Then, after inferring a new document's vector, calculate its average distance to documents of just the category of interest.

    If you instead train a model on only documents of a certain self-similar category, the learned coordinate-space won't as well reflect the full range of possible documents outside that category.

    That said, if your current code – which checks how similar a new document is to the top-N nearest neighbors - seems to give good results for your purposes, maybe it's acceptable. I'd just expect better results from a model that had trained on a wider variety of documents.