I have noticed that my gensim Doc2Vec (DBOW) model is sensitive to document tags. My understanding was that these tags are cosmetic and so they should not influence the learned embeddings. Am I misunderstanding something? Here is a minimal example:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import numpy as np
import os
os.environ['PYTHONHASHSEED'] = '0'
reps = []
for a in [0,500]:
documents = [TaggedDocument(doc, [i + a])
for i, doc in enumerate(common_texts)]
model = Doc2Vec(documents, vector_size=100, window=2, min_count=0,
workers=1, epochs=10, dm=0, seed=0)
reps.append(np.array([model.docvecs[k] for k in range(len(common_texts))])
reps[0].sum() == reps[1].sum()
This last line returns False
. I am working with gensim 3.8.3 and Python 3.5.2. More generally, is there any role that the values of the tags play (assuming they are unique)? I ask because I have found that using different tags for documents in a classification task leads to widely varying performance.
Thanks in advance.
Have you checked the magnitude of the differences?
Just running:
delta = reps[0].sum() - reps[1].sum()
for the aggregate differences results with -1.2598932e-05
when I run it.
Comparison dimension-wise:
eps = 10**-4
over = (np.abs(diff) <= eps).all()
Returns True
on a vast majority of the runs which means that you are getting quite reproducible results given the complexity of the calculations.
I would blame numerical stability of the calculations or uncontrolled randomness. Even though you do try to control the random seed, there is a different random seed in NumPy and different in random
standard library so you are not controlling for all of the sources of randomness. This can also have an influence on the results but I did not check the actual implementation in gensim
and it's dependencies.