I am using Doc2Vec to analysis some paragraph and wish to get deterministic vector representation of the train data. Based on the official documentation, it seems that I need to set the parameters "seed" and "workers", as well as the PYTHONHASHSEED environment variable in Python 3. Therefore, I wrote the script as follows.
import os
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
def main():
# Check whether the environment variable has been set successfully
print(os.environ.get('PYTHONHASHSEED'))
docs = [TaggedDocument(['Apple', 'round', 'apple', 'red', 'Apple', 'juicy', 'apple', 'sweet'], ['A']),
TaggedDocument(['I', 'have', 'a', 'little', 'frog', 'His', 'name', 'is', 'Tiny', 'Tim'], ['B']),
TaggedDocument(['On', 'top', 'of', 'spaghetti', 'all', 'covered', 'with', 'cheese'], ['C'])]
# Loop 3 times to check whether consistent results are produced within each run
for i in range(3):
model = Doc2Vec(min_count=1, seed=12345, workers=1)
model.build_vocab(docs)
model.train(docs, total_examples=model.corpus_count, epochs=model.epochs)
print(model.docvecs['B'])
if __name__ == '__main__':
os.environ['PYTHONHASHSEED'] = '12345'
main()
The problem is that within each run it does produce deterministic results, but when I run the whole script again it gives different results. Is there any problem with my environment variable setting, or am I missing out something else?
I am on Python 3.6.5.
I believe setting PYTHONHASHSEED
inside your code is too late: it needs to be set in the OS environment, before the Python interpreter runs at all. When Python launches, it checks for this to decide whether all dictionaries during this execution will use the specified randomization seed. (It isn't rechecked later, for each subsequent dictionary creation.)
But also, note that you generally shouldn't force determinism on these algorithms – but rather make your evaluations tolerant of small run-to-run jitter. Large jitter can be an indication of other problems with the sufficiency of your data or metaparameters – but forcing determinism hides this valuable indirect signal of model strength.
There's a bit more discussion in Q11 & Q12 of the gensim project FAQ about these issues: