Search code examples
gensimword2vecdoc2vec

Gensim Doc2Vec infer_vector on unseen words differs based on characters in these words


Gensim Doc2Vec infer_vector on paragraphs with unseen words generates vectors that differ based on the characters in the unsween words.

for i in range(0, 2):
    print(model.infer_vector(["zz"])[0:2])
    print(model.infer_vector(["zzz"])[0:2])
    print(model.infer_vector(["zzzz"])[0:2])
    print("\n")

[ 0.00152548 -0.00055992]
[-0.00165872 -0.00047997]
[0.00125548 0.00053445]


[ 0.00152548 -0.00055992] # same as in previous iteration
[-0.00165872 -0.00047997]
[0.00125548 0.00053445]

I am trying understand how unseen words affect initialization of the infer_vector. It looks like different characters will produce different vectors. Trying to understand why.


Solution

  • Unseen words are ignored for the actual process of iterative inference: tuning a vector to better-predict a text's words, according to a frozen Doc2Vec model.

    However, inference starts with a pseudorandomly-initialized vector. And, the full set of tokens passed-in (including unknown words) are used as the seed for that random-initialization.

    This seeded initialization is done as a potential small aid to those seeking fully-reproducible inference – but in practice, seeking such exact-reproduction, rather than just run-to-run similarity, is usually a bad idea. See the gensim FAQs Q11 & Q12 about varying results from run-to-run for more details.

    So what you're seeing is:

    • your different tokenized texts each cause a pseudorandom, but deterministic with respect to the source text, vector initialization
    • since no words are known, inference afterwards is a no-op: there are no words to predict
    • the pseudorandom initialized vector is returned

    The infer_vector() method should probably log a warning, or return a flag value (like perhaps the origin vector), as a better hint that nothing meaningful is actually happening.

    But you may wish to check any text before you supply it to infer_vector() – if none of its words are in the d2v_model.wv, then inference will simply be returning a small random initialization vector.