Gensim Doc2Vec infer_vector on paragraphs with unseen words generates vectors that differ based on the characters in the unsween words.
for i in range(0, 2):
print(model.infer_vector(["zz"])[0:2])
print(model.infer_vector(["zzz"])[0:2])
print(model.infer_vector(["zzzz"])[0:2])
print("\n")
[ 0.00152548 -0.00055992]
[-0.00165872 -0.00047997]
[0.00125548 0.00053445]
[ 0.00152548 -0.00055992] # same as in previous iteration
[-0.00165872 -0.00047997]
[0.00125548 0.00053445]
I am trying understand how unseen words affect initialization of the infer_vector. It looks like different characters will produce different vectors. Trying to understand why.
Unseen words are ignored for the actual process of iterative inference: tuning a vector to better-predict a text's words, according to a frozen Doc2Vec
model.
However, inference starts with a pseudorandomly-initialized vector. And, the full set of tokens passed-in (including unknown words) are used as the seed for that random-initialization.
This seeded initialization is done as a potential small aid to those seeking fully-reproducible inference – but in practice, seeking such exact-reproduction, rather than just run-to-run similarity, is usually a bad idea. See the gensim FAQs Q11 & Q12 about varying results from run-to-run for more details.
So what you're seeing is:
The infer_vector()
method should probably log a warning, or return a flag value (like perhaps the origin vector), as a better hint that nothing meaningful is actually happening.
But you may wish to check any text before you supply it to infer_vector()
– if none of its words are in the d2v_model.wv
, then inference will simply be returning a small random initialization vector.