In Gensim's doc2vec implementation, gensim.models.keyedvectors.Doc2VecKeyedVectors.most_similar
returns the tags and cosine similarity of the documents most similar to the query document. What if I want the actual documents themselves and not the tags? Is there a way to do that directly without searching for the document associated with the tag returned by most_similar
?
Also, is there documentation on this? I can't seem to find the documentation for half of Gensim's classes.
The Doc2Vec
class doesn't serve as a full document database that stores the original documents in their original formats. That would require a lot of extra complexity and state.
Instead, you just present the docs, with their particular tags, in the tokenized format it needs for training, and the model only learns and retains their vector representations.
If you need to then look-up the original documents, you must maintain your own (tags -> documents) lookup – which many projects will already have as the original source of the docs.
The Doc2Vec
class docs are at https://radimrehurek.com/gensim/models/doc2vec.html but it may also be helpful to look at the example Jupyter notebooks included in the gensim
docs/notebooks
directory but also viewable online at:
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
The three notebooks related to Doc2Vec
have filenames beginning doc2vec-
.