python nlp gensim doc2vec sentence-similarity

Gensim Doc2Vec most_similar() method not working as expected

I am struggling with Doc2Vec and I cannot see what I am doing wrong. I have a text file with sentences. I want to know, for a given sentence, what is the closest sentence we can find in that file.

Here is the code for model creation:

sentences = LabeledLineSentence(filename)

model = models.Doc2Vec(size=300, min_count=1, workers=4, window=5, alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
model.train(sentences, epochs=50, total_examples=model.corpus_count)
model.save(modelName)

For test purposes, here is my file:

uduidhud duidihdd
dsfsdf sdf sddfv
dcv dfv dfvdf g fgbfgbfdgnb
i like dogs
sgfggggggggggggggggg ggfggg

And here is my test:

test = "i love dogs".split()
print(model.docvecs.most_similar([model.infer_vector(test)]))

No matter what parameter for training, this should obviously tell me that the most similar sentence is the 4th one (SENT_3 or SENT_4, I don't know how their indexes work, but the sentence labels are this form). But here is the result:

[('SENT_0', 0.15669342875480652),
 ('SENT_2', 0.0008485736325383186),
 ('SENT_4', -0.009077289141714573)]

What am I missing ? And if I try with the same sentence (I LIKE dogs), I have SENT_2, then 1 then 4... I really don't get it. And why such low numbers ? And when I run few times in a row with a load, I don't get the same results either.

Thanks for your help

Solution

Doc2Vec doesn't work well on toy-sized examples. (Published work uses tens-of-thousands to millions of texts, and even tiny unit tests inside gensim uses hundreds-of-texts, combined with a much-smaller vector size and many more iter epochs, to get just-barely reliable results.)

So, I would not expect your code to have consistent or meaningful results. This is especially the case when:

maintaining a large vector size with tiny data (which allows severe model overfitting)
using a min_count=1 (because words without many varied usage examples can't get good vectors)
changing the min_alpha to remain the same as the larger starting alpha (because the stochastic gradient descent learning algorithm's usually-beneficial behavior relies on a gradual decay of this update-rate)
using documents of just a few words (as the doc-vectors are trained in proportion to the number of words they contain)

Finally, even if everything else was working, infer_vector() usually benefits from many more steps than the default 5 (to the tens or hundreds of), and sometimes a starting alpha less like its inference default (0.1) and more like the training value (0.025).

So:

don't change min_count or min_alpha
get much more data
if it's not tens-of-thousands of texts, use a smaller vector size and more epochs (but realize results may still be weak with small data sets)
if each text is tiny, use more epochs (but realize results may still be a weaker than with longer texts)
try other infer_vector() parameters, such as steps=50 (or more, especially with small texts), and alpha=0.025