Search code examples
pythongensimdoc2vec

Can doc2vec training result could change with same input data, and same parameter?


I'm using Doc2Vec in gensim library, and finding similiarity between movie, with its name as input.

model = doc2vec.Doc2Vec(vector_size=100, alpha=0.025, min_alpha=0.025, window=5)
model.build_vocab(tagged_corpus_list)
model.train(tagged_corpus_list, total_examples=model.corpus_count, epochs=50)

I set parameter like this, and didn't change preprocessing mechanism of input data, didn't changed original data.

similar_doc = model.dv.most_similar(input)

I also used this code to find most similar movie. When I restarted code to train this model, the most similar movie has changed, with changed score. Is this possible? Why? If then, how can I fix the training result?


Solution

  • Yes, this sort of change from run to run is normal. It's well-explained in question 11 of the Gensim FAQ:

    Q11: I've trained my Word2Vec / Doc2Vec / etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)

    Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization during training. (For example, the training windows are randomly truncated as an efficient way of weighting nearer words higher. The negative examples in the default negative-sampling mode are chosen randomly. And the downsampling of highly-frequent words, as controlled by the sample parameter, is driven by random choices. These behaviors were all defined in the original Word2Vec paper's algorithm description.)

    Even when all this randomness comes from a pseudorandom-number-generator that's been seeded to give a reproducible stream of random numbers (which gensim does by default), the usual case of multi-threaded training can further change the exact training-order of text examples, and thus the final model state. (Further, in Python 3.x, the hashing of strings is randomized each re-launch of the Python interpreter - changing the iteration ordering of vocabulary dicts from run to run, and thus making even the same string-of-random-number-draws pick different words in different launches.)

    So, it is to be expected that models vary from run to run, even trained on the same data. There's no single "right place" for any word-vector or doc-vector to wind up: just positions that are at progressively more-useful distances & directions from other vectors co-trained inside the same model. (In general, only vectors that were trained together in an interleaved session of contrasting uses become comparable in their coordinates.)

    Suitable training parameters should yield models that are roughly as useful, from run-to-run, as each other. Testing and evaluation processes should be tolerant of any shifts in vector positions, and of small "jitter" in the overall utility of models, that arises from the inherent algorithm randomness. (If the observed quality from run-to-run varies a lot, there may be other problems: too little data, poorly-tuned parameters, or errors/weaknesses in the evaluation method.)

    You can try to force determinism, by using workers=1 to limit training to a single thread – and, if in Python 3.x, using the PYTHONHASHSEED environment variable to disable its usual string hash randomization. But training will be much slower than with more threads. And, you'd be obscuring the inherent randomness/approximateness of the underlying algorithms, in a way that might make results more fragile and dependent on the luck of a particular setup. It's better to tolerate a little jitter, and use excessive jitter as an indicator of problems elsewhere in the data or model setup – rather than impose a superficial determinism.

    If the change between runs is small – nearest neighbors mostly the same, with a few in different positions – it's best to tolerate it.

    If the change is big, there's likely some other problem, like insufficient training data or poorly-chosen parameters.

    Notably, min_alpha=0.025 isn't a sensible value - the training is supposed to use a gradually-decreasing value, and the usual default (min_alpha=0.0001) usually doesn't need changing. (If you copied this from an online example: that's a bad example! Don't trust that site unless it explains why it's doing an odd thing.)

    Increasing the number of training epochs, from the default epochs=5 to something like 10 or 20 may also help make run-to-run results more consistent, especially if you don't have plentiful training data.