python machine-learning nlp word2vec doc2vec

How to use pretrained word2vec vectors in doc2vec model?

I am trying to implement doc2vec, but I am not sure how the input for the model should look like if I have pretrained word2vec vectors.

The problem is, that I am not sure how to theoretically use pretrained word2vec vectors for doc2vec. I imagine, that I could prefill the hidden layer with the vectors and the rest of the hidden layer fill with random numbers

Another idea is to use the vector as input for word instead of a one-hot-encoding but I am not sure if the output vectors for docs would make sense.

Thank you for your answer!

Solution

You might think that Doc2Vec (aka the 'Paragraph Vector' algorithm of Mikolov/Le) requires word-vectors as a 1st step. That's a common belief, and perhaps somewhat intuitive, by analogy to how humans learn a new language: understand the smaller units before the larger, then compose the meaning of the larger from the smaller.

But that's a common misconception, and Doc2Vec doesn't do that.

One mode, pure PV-DBOW (dm=0 in gensim), doesn't use conventional per-word input vectors at all. And, this mode is often one of the fastest-training and best-performing options.

The other mode, PV-DM (dm=1 in gensim, the default) does make use of neighboring word-vectors, in combination with doc-vectors in a manner analgous to word2vec's CBOW mode – but any word-vectors it needs will be trained-up simultaneously with doc-vectors. They are not trained 1st in a separate step, so there's not a easy splice-in point where you could provide word-vectors from elsewhere.

(You can mix skip-gram word-training into the PV-DBOW, with dbow_words=1 in gensim, but that will train word-vectors from scratch in an interleaved, shared-model process.)

To the extent you could pre-seed a model with word-vectors from elsewhere, it wouldn't necessarily improve results: it could easily send their quality sideways or worse. It might in some lucky well-managed cases speed model convergence, or be a way to enforce vector-space-compatibility with an earlier vector-set, but not without extra gotchas and caveats that aren't a part of the original algorithms, or well-described practices.