What is the effect of assigning the same label to a bunch of sentences in doc2vec? I have a collection of documents that I want to learn vectors using gensim for a "file" classification task where file refers to a collection of documents for a given ID. I have several ways of labeling in mind and I want to know what would be the difference between them and which is the best -
Take a document d1, assign label doc1
to the tags and train. Repeat for others
Take a document d1, assign label doc1
to the tags. Then tokenize document into sentences and assign label doc1
to its tags and then train with both full document and individual sentences. Repeat for others
For example (ignore that the sentence isn't tokenized) -
Document - "It is small. It is rare"
TaggedDocument(words=["It is small. It is rare"], tags=['doc1'])
TaggedDocument(words=["It is small."], tags=['doc1'])
TaggedDocument(words=["It is rare."], tags=['doc1'])
doc1
. The full document has the all the sentence tags along with doc1
.Example -
Document - "It is small. It is rare"
TaggedDocument(words=["It is small. It is rare"], tags=['doc1', 'doc1_sentence1', 'doc1_sentence2'])
TaggedDocument(words=["It is small."], tags=['doc1', 'doc1_sentence1'])
TaggedDocument(words=["It is rare."], tags=['doc1', 'doc1_sentence2'])
I also have some additional categorical tags that I'd be assigning. So what would be the best approach?
You can do all this! Assigning the same tag to multiple texts has almost the same effect as would combining those texts into one larger text, and assigning it that tag. The slight differences would be for Doc2Vec
modes where there's a context-window – PV-DM (dm=1
). With separate texts, there'd never be contexts stretching across the end/beginning of sentences.
In fact, as gensim
's optimized code paths have a 10,000-token limit to text sizes, splitting larger documents into subdocuments, but repeating their tags is sometimes necessary as a workaround.
What you've specifically proposed, training both the full-doc, and the doc-fragments, would work, but also have the effect of doubling the amount of text (and thus training-attention/individual-prediction-examples) for the 'doc1'
tags, compared to the narrower per-sentence tags. You might want that, or not - it could affect the relative quality of each.
What's best is unclear - it depends on your corpus, and end goals, so should be determined through experimentation, with a clear end-evaluation so that you can automate/systematize a rigorous search for what's best.
A few relevant notes, though:
Doc2Vec
tends to works better with docs of at least a dozen or more words per document.'words'
need to be tokenized - a list-of-strings, not a string.