I am struggling with Doc2Vec and I cannot see what I am doing wrong. I have a text file with sentences. I want to know, for a given sentence, what is the closest sentence we can find in that file.
Here is the code for model creation:
sentences = LabeledLineSentence(filename)
model = models.Doc2Vec(size=300, min_count=1, workers=4, window=5, alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
model.train(sentences, epochs=50, total_examples=model.corpus_count)
model.save(modelName)
For test purposes, here is my file:
uduidhud duidihdd
dsfsdf sdf sddfv
dcv dfv dfvdf g fgbfgbfdgnb
i like dogs
sgfggggggggggggggggg ggfggg
And here is my test:
test = "i love dogs".split()
print(model.docvecs.most_similar([model.infer_vector(test)]))
No matter what parameter for training, this should obviously tell me that the most similar sentence is the 4th one (SENT_3 or SENT_4, I don't know how their indexes work, but the sentence labels are this form). But here is the result:
[('SENT_0', 0.15669342875480652),
('SENT_2', 0.0008485736325383186),
('SENT_4', -0.009077289141714573)]
What am I missing ? And if I try with the same sentence (I LIKE dogs), I have SENT_2, then 1 then 4... I really don't get it. And why such low numbers ? And when I run few times in a row with a load, I don't get the same results either.
Thanks for your help
Doc2Vec
doesn't work well on toy-sized examples. (Published work uses tens-of-thousands to millions of texts, and even tiny unit tests inside gensim
uses hundreds-of-texts, combined with a much-smaller vector size
and many more iter
epochs, to get just-barely reliable results.)
So, I would not expect your code to have consistent or meaningful results. This is especially the case when:
size
with tiny data (which allows severe model overfitting)min_count=1
(because words without many varied usage examples can't get good vectors)min_alpha
to remain the same as the larger starting alpha (because the stochastic gradient descent learning algorithm's usually-beneficial behavior relies on a gradual decay of this update-rate)Finally, even if everything else was working, infer_vector()
usually benefits from many more steps
than the default 5 (to the tens or hundreds of), and sometimes a starting alpha
less like its inference default (0.1) and more like the training value (0.025).
So:
min_count
or min_alpha
size
and more epochs
(but realize results may still be weak with small data sets)epochs
(but realize results may still be a weaker than with longer texts)infer_vector()
parameters, such as steps=50
(or more, especially with small texts), and alpha=0.025