I have about 9000 documents and I am using Gensim's doc2vec
to embed my documents. My code is as follows:
from gensim.models import doc2vec
from collections import namedtuple
dataset = json.load(open(input_file))
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for description in dataset:
tags = [description[0]]
words = description[1]
docs.append(analyzedDocument(words, tags))
model = doc2vec.Doc2Vec(docs, vector_size = 100, window = 10, min_count = 1, workers = 4, epochs = 20)
I would like to get all the documents related to topic "deep learning". i.e. the documents that mainly have content related to deep learning. Is it possible to do this in doc2vec model in gensim?
I am happy to provide more details if needed.
If there was a document in your training set that was a great example of "deep learning" – say, docs[17]
– then after successful training you could ask for documents similar to that example document, and that could be roughly what you'd need. For example:
sims = model.docvecs.most_similar(docs[17].tags[0])
You'd then have in sims
a ranked, scored list of the 10 most-similar documents to the tag
for the target document.