Search code examples
pythonnlpgensimdoc2vec

Document similarity with doc2vec


With this Gensim example in github, https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb it provides examples at the end to find simalarities with phrases or keywords, like 'lady gaga' or 'machine learning'. However am looking to find similarity with actual document in plain text file, could this be done? and how can I do it? suppose text file is located on my local laptop in txt format.


Solution

  • Tokenize the query-document the same as the training data. Pass those tokens to the Doc2Vec model's infer_vector() method to get a vector for the query-document. Pass that vector to most_similar() to get a ranked list of known documents similar to that vector.

    There are examples of using infer_vector() this way in cells 10 and forward in another demo notebook included with gensim:

    https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb