python-3.x nlp text-mining gensim doc2vec

Gensim doc2vec most_similar equivalent to get full documents

In Gensim's doc2vec implementation, gensim.models.keyedvectors.Doc2VecKeyedVectors.most_similar returns the tags and cosine similarity of the documents most similar to the query document. What if I want the actual documents themselves and not the tags? Is there a way to do that directly without searching for the document associated with the tag returned by most_similar?

Also, is there documentation on this? I can't seem to find the documentation for half of Gensim's classes.

Solution

The Doc2Vec class doesn't serve as a full document database that stores the original documents in their original formats. That would require a lot of extra complexity and state.

Instead, you just present the docs, with their particular tags, in the tokenized format it needs for training, and the model only learns and retains their vector representations.

If you need to then look-up the original documents, you must maintain your own (tags -> documents) lookup – which many projects will already have as the original source of the docs.

The Doc2Vec class docs are at https://radimrehurek.com/gensim/models/doc2vec.html but it may also be helpful to look at the example Jupyter notebooks included in the gensim docs/notebooks directory but also viewable online at:

https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks

The three notebooks related to Doc2Vec have filenames beginning doc2vec-.