I am building a Doc2Vec model with 1000 documents using Gensim. Each document has consisted of several sentences which include multiple words.
Example)
Doc1: [[word1, word2, word3], [word4, word5, word6, word7],[word8, word9, word10]]
Doc2: [[word7, word3, word1, word2], [word1, word5, word6, word10]]
Initially, to train the Doc2Vec, I first split sentences and tag each sentence with the same document tag using "TaggedDocument". As a result, I got the final training input for Doc2Vec as follows:
TaggedDocument(words=[word1, word2, word3], tags=['Doc1'])
TaggedDocument(words=[word4, word5, word6, word7], tags=['Doc1'])
TaggedDocument(words=[word8, word9, word10], tags=['Doc1'])
TaggedDocument(words=[word7, word3, word1, word2], tags=['Doc2'])
TaggedDocument(words=[word1, word5, word6, word10], tags=['Doc2'])
However, would it be okay to train the model with the document as a whole without splitting sentences?
TaggedDocument(words=[word1, word2, word3,word4, word5, word6, word7,word8, word9, word10], tags=['Doc1'])
TaggedDocument(words=[word4, word5, word6, word7,word1, word5, word6, word10], tags=['Doc2'])
Thank you in advance :)
Both approaches are going to be very similar in their effect.
The slight difference is that in PV-DM modes (dm=1
), or PV-DBOW with added skip-gram training (dm=0, dbow_words=1
), if you split by sentence, words in different sentences will never be within the same context-window.
For example, your 'Doc1'
words 'word3'
and 'word4'
would never be averaged-together in the same PV-DM context-window-average, nor be used to PV-DBOW skip-gram predict-each-other, if you split by sentences. If you just run the whole doc's words together into a single TaggedDocument
example, they would interact more, via appearing in shared context-windows.
Whether one or the other is better for your purposes is something you'd have to evaluate in your own analysis - it could depend a lot on the nature of the data & desired similarity results.
But, I can say that your second option, all the words in one TaggedDocument
, is the more common/traditional approach.
(That is, as long as the document is still no more than 10,000 tokens long. If longer, splitting the doc's words into multiple TaggedDocument
instances, each with the same tags
, is a common workaround for an internal 10,000-token implementation limit.)