Search code examples
gensimdoc2vec

What is different between doc2vec models when the dbow_words is set to 1 or 0?


I read this page but I do not understand what is different between models which are built based on the following codes. I know when dbow_words is 0, training of doc-vectors is faster.

First model

model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4)

Second model

model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4,dbow_words=1)

Solution

  • The dbow_words parameter only has effect when training a DBOW model – that is, with the non-default dm=0 parameter.

    So, between your two example lines of code, which both leave the default dm=1 value unchanged, there's no difference.

    If you instead switch to DBOW training, dm=0, then with a default dbow_words=0 setting, the model is pure PV-DBOW as described in the original 'Paragraph Vectors' paper. Doc-vectors are trained to be predictive of text example words, but no word-vectors are trained. (There'll still be some randomly-initialized word-vectors in the model, but they're not used or improved during training.) This mode is fast and still works pretty well.

    If you add the dbow_words=1 setting, then skip-gram word-vector training will be added to the training, in an interleaved fashion. (For each text example, both doc-vectors over the whole text, then word-vectors over each sliding context window, will be trained.) Since this adds more training examples, as a function of the window parameter, it will be significantly slower. (For example, with window=5, adding word-training will make training about 5x slower.)

    This has the benefit of placing both the DBOW doc-vectors and the word-vectors into the "same space" - perhaps making the doc-vectors more interpretable by their closeness to words.

    This mixed training might serve as a sort of corpus-expansion – turning each context-window into a mini-document – that helps improve the expressiveness of the resulting doc-vector embeddings. (Though, especially with sufficiently large and diverse document sets, it may be worth comparing against pure-DBOW with more passes.)