Search code examples
pythonnlpgensimdoc2vecsentence-similarity

Gensim Doc2Vec most_similar() method not working as expected


I am struggling with Doc2Vec and I cannot see what I am doing wrong. I have a text file with sentences. I want to know, for a given sentence, what is the closest sentence we can find in that file.

Here is the code for model creation:

sentences = LabeledLineSentence(filename)

model = models.Doc2Vec(size=300, min_count=1, workers=4, window=5, alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
model.train(sentences, epochs=50, total_examples=model.corpus_count)
model.save(modelName)

For test purposes, here is my file:

uduidhud duidihdd
dsfsdf sdf sddfv
dcv dfv dfvdf g fgbfgbfdgnb
i like dogs
sgfggggggggggggggggg ggfggg

And here is my test:

test = "i love dogs".split()
print(model.docvecs.most_similar([model.infer_vector(test)]))

No matter what parameter for training, this should obviously tell me that the most similar sentence is the 4th one (SENT_3 or SENT_4, I don't know how their indexes work, but the sentence labels are this form). But here is the result:

[('SENT_0', 0.15669342875480652),
 ('SENT_2', 0.0008485736325383186),
 ('SENT_4', -0.009077289141714573)]

What am I missing ? And if I try with the same sentence (I LIKE dogs), I have SENT_2, then 1 then 4... I really don't get it. And why such low numbers ? And when I run few times in a row with a load, I don't get the same results either.

Thanks for your help


Solution

  • Doc2Vec doesn't work well on toy-sized examples. (Published work uses tens-of-thousands to millions of texts, and even tiny unit tests inside gensim uses hundreds-of-texts, combined with a much-smaller vector size and many more iter epochs, to get just-barely reliable results.)

    So, I would not expect your code to have consistent or meaningful results. This is especially the case when:

    • maintaining a large vector size with tiny data (which allows severe model overfitting)
    • using a min_count=1 (because words without many varied usage examples can't get good vectors)
    • changing the min_alpha to remain the same as the larger starting alpha (because the stochastic gradient descent learning algorithm's usually-beneficial behavior relies on a gradual decay of this update-rate)
    • using documents of just a few words (as the doc-vectors are trained in proportion to the number of words they contain)

    Finally, even if everything else was working, infer_vector() usually benefits from many more steps than the default 5 (to the tens or hundreds of), and sometimes a starting alpha less like its inference default (0.1) and more like the training value (0.025).

    So:

    • don't change min_count or min_alpha
    • get much more data
    • if it's not tens-of-thousands of texts, use a smaller vector size and more epochs (but realize results may still be weak with small data sets)
    • if each text is tiny, use more epochs (but realize results may still be a weaker than with longer texts)
    • try other infer_vector() parameters, such as steps=50 (or more, especially with small texts), and alpha=0.025