python-3.x nlp gensim similarity doc2vec

How to improve the reproducibility of Doc2vec cosine similarity

I am using Gensim's Doc2vec to train a model, and I use the infer_vector to infer the vector of a new document to compare the similarity document of the model. However, reusing the same document can have very different results. This way there is no way to accurately evaluate similar documents.
The search network mentions that infer_vector has random characteristics, so each time a new text vector is produced, it will be different.
Is there any way to solve this problem?

model_dm =pickle.load(model_pickle)

inferred_vector_dm = model_dm.infer_vector(i)  

simsinput =model_dm.docvecs.most_similar([inferred_vector_dm],topn=10)

Solution

If you supply an optional epochs argument to infer_vector() that's larger than the default, the resulting vectors, from run to run on a single text, should become more similar. (This will likely be especially helpful on small texts.)

That is, there should only be a small "jitter" between runs, and that shouldn't make a big difference in your later comparisons. (Your downstream comparisons should be tolerant of small changes.) With an algorithm like this, that uses randomization, there's no absolutely "right" result, just useful results.

If the variance between runs remains large – for example changing the most_similar() results significantly from run-to-run, then there might be other problems with your model or setup:

Doc2Vec doesn't work well on toy-sized training sets – published work uses document sets of 10s-of-thousands to millions of documents, where documents are dozens to thousands of words each. If you're using just a handful of short sentences, you won't get good results.
infer_vector() needs to get a list-of-string-tokens, not a string. And, those tokens should have been preprocessed in the same way as the training data. Any unknown words fed to infer_vector() will be ignored, making the input shorter (or zero-length), making results more (or totally) random.

Separately, gensim's Doc2Vec has native .save() and .load() methods which should be used rather than raw pickle – especially on larger models, they'll do things more efficiently or without errors. (Though note: they may create multiple save files, which should be kept together so that loading the main file can find the subsidiary files.)