Search code examples
word2vecdoc2vecsentence-similarity

Does Mikolov 2014 Paragraph2Vec models assume sentence ordering?


In Mikolov 2014 paper regarding paragraph2Vectors, https://arxiv.org/pdf/1405.4053v2.pdf, do the authors assume in both PV-DM and PV-DBOW, the ordering of sentences need to make sense?

Imagine I am handling a stream of tweets, and each tweet is a paragraph. The paragraphs/tweets do not necessarily have ordering relations. After training, does the vector embedding for paragraphs still make sense?


Solution

  • Each document/paragraph is treated as a single unit for training – and there’s no explicit way that the neighboring documents directly affect a document’s vector. So the ordering of documents doesn’t have to be natural.

    In fact, you generally don’t want all similar text-examples to be clumped together – for example, all those on a certain topic, or using a certain vocabulary, in the front or back of all training examples. That’d mean those examples are all trained with a similar alpha learning rate, and affect all related words without interleaved offsetting examples with other words. Either of those could make a model slightly less balanced/general, across all possible documents. For this reason, it can be good to perform at least one initial shuffle of the text examples before training a gensim Doc2Vec (or Word2Vec) model, if your natural ordering might not spread all topics/vocabulary words evenly through the training corpus.

    The PV-DM modes (default dm=1 mode in gensim) do involve sliding context-windows of nearby words, so word proximity within each example matters. (Don’t shuffle the words inside each text!)