I don't understand how word vectors are involved at all in the training process with gensim's doc2vec in DBOW mode (dm=0
). I know that it's disabled by default with dbow_words=0
. But what happens when we set dbow_words
to 1?
In my understanding of DBOW, the context words are predicted directly from the paragraph vectors. So the only parameters of the model are the N
p
-dimensional paragraph vectors plus the parameters of the classifier.
But multiple sources hint that it is possible in DBOW mode to co-train word and doc vectors. For instance:
So, how is this done? Any clarification would be much appreciated!
Note: for DM, the paragraph vectors are averaged/concatenated with the word vectors to predict the target words. In that case, it's clear that words vectors are trained simultaneously with document vectors. And there are N*p + M*q + classifier
parameters (where M
is vocab size and q
word vector space dim).
If you set dbow_words=1
, then skip-gram word-vector training is added the to training loop, interleaved with the normal PV-DBOW training.
So, for a given target word in a text, 1st the candidate doc-vector is used (alone) to try to predict that word, with backpropagation adjustments then occurring to the model & doc-vector. Then, a bunch of the surrounding words are each used, one at a time in skip-gram fashion, to try to predict that same target word – with the followup adjustments made.
Then, the next target word in the text gets the same PV-DBOW plus skip-gram treatment, and so on, and so on.
As some logical consequences of this:
training takes longer than plain PV-DBOW - by about a factor equal to the window
parameter
word-vectors overall wind up getting more total training attention than doc-vectors, again by a factor equal to the window
parameter