Search code examples
pythonnlpgensimword2vecdoc2vec

Doc2Vec: Similarity Between Coded Documents and Unseen Documents


I have a sample of ~60,000 documents. We've hand coded 700 of them as having a certain type of content. Now we'd like to find the "most similar" documents to the 700 we already hand-coded. We're using gensim doc2vec and I can't quite figure out the best way to do this.

Here's what my code looks like:

cores = multiprocessing.cpu_count()

model = Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0, 
        epochs=10, workers=cores, dbow_words=1, train_lbls=False)

all_docs = load_all_files() # this function returns a named tuple
random.shuffle(all_docs)
print("Docs loaded!")
model.build_vocab(all_docs)
model.train(all_docs, total_examples=model.corpus_count, epochs=5)

I can't figure out the right way to go forward. Is this something that doc2vec can do? In the end, I'd like to have a ranked list of the 60,000 documents, where the first one is the "most similar" document.

Thanks for any help you might have! I've spent a lot of time reading the gensim help documents and the various tutorials floating around and haven't been able to figure it out.

EDIT: I can use this code to get the documents most similar to a short sentence:

token = "words associated with my research questions".split()
new_vector = model.infer_vector(token)
sims = model.docvecs.most_similar([new_vector])
for x in sims:
    print(' '.join(all_docs[x[0]][0]))

If there's a way to modify this to instead get the documents most similar to the 700 coded documents, I'd love to learn how to do it!


Solution

  • Your general approach is reasonable. A few notes about your setup:

    • you'd have to specify epochs=10 in your train() call to truly get 10 training passes – and 10 or more is most common in published work
    • sample-controlled downsampling helps speed training and often improves vector quality as well, and the value can become more aggressive (smaller) with larger datasets
    • train_lbls is not a parameter to Doc2Vec in any recent gensim version

    There are several possible ways to interpret and pursue your goal of "find the 'most similar' documents to the 700 we already hand-coded". For example, for a candidate document, how should its similarity to the set-of-700 be defined - as a similarity to one summary 'centroid' vector for the full set? Or as its similarity to any one of the documents?

    There are a couple ways you could obtain a single summary vector for the set:

    • average their 700 vectors together

    • combine all their words into one synthetic composite document, and infer_vector() on that document. (But note: texts fed to gensim's optimized word2vec/doc2vec routines face an internal implementation limit of 10,000 tokens – excess words are silently ignored.)

    In fact, the most_similar() method can take a list of multiple vectors as its 'positive' target, and will automatically average them together before returning its results. So if, say, the 700 document IDs (tags used during training) are in the list ref_docs, you could try...

    sims = model.docvecs.most_similar(positive=ref_docs, topn=0)
    

    ...and get back a ranked list of all other in-model documents, by their similarity to the average of all those positive examples.

    However, the alternate interpretation, that a document's similarity to the reference-set is its highest similarity to any one document inside the set, might be better for your purpose. This could especially be the case if the reference set itself is varied over many themes – and thus not well-summarized by a single average vector.

    You'd have to compute these similarities with your own loops. For example, roughly:

    sim_to_ref_set = {}
    for doc_id in all_doc_ids:
        sim_to_ref_set[doc_id] = max([model.docvecs.similarity(doc_id, ref_id) for ref_id in ref_docs])
    sims_ranked = sorted(sim_to_ref_set.items(), key=lambda it:it[1], reverse=True)
    

    The top items in sims_ranked would then be those most-similar to any item in the reference set. (Assuming the reference-set ids are also in all_doc_ids, the 1st 700 results will be the chosen docs again, all with a self-similarity of 1.0.)