Search code examples
pythongensimword2vecdoc2vec

How to get document vectors for a given topic in gensim


I have about 9000 documents and I am using Gensim's doc2vec to embed my documents. My code is as follows:

from gensim.models import doc2vec
from collections import namedtuple

dataset = json.load(open(input_file))

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')

for description in dataset:
    tags = [description[0]]
    words = description[1]
    docs.append(analyzedDocument(words, tags))

model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)

I would like to get all the documents related to topic "deep learning". i.e. the documents that mainly have content related to deep learning. Is it possible to do this in doc2vec model in gensim?

I am happy to provide more details if needed.


Solution

  • If there was a document in your training set that was a great example of "deep learning" – say, docs[17] – then after successful training you could ask for documents similar to that example document, and that could be roughly what you'd need. For example:

    sims = model.docvecs.most_similar(docs[17].tags[0])
    

    You'd then have in sims a ranked, scored list of the 10 most-similar documents to the tag for the target document.