Search code examples
pythonnlpword2vecgensimdoc2vec

Hierarchical training for doc2vec: how would assigning same labels to sentences of the same document work?


What is the effect of assigning the same label to a bunch of sentences in doc2vec? I have a collection of documents that I want to learn vectors using gensim for a "file" classification task where file refers to a collection of documents for a given ID. I have several ways of labeling in mind and I want to know what would be the difference between them and which is the best -

  • Take a document d1, assign label doc1 to the tags and train. Repeat for others

  • Take a document d1, assign label doc1 to the tags. Then tokenize document into sentences and assign label doc1 to its tags and then train with both full document and individual sentences. Repeat for others

For example (ignore that the sentence isn't tokenized) -

Document -  "It is small. It is rare" 
TaggedDocument(words=["It is small. It is rare"], tags=['doc1'])
TaggedDocument(words=["It is small."], tags=['doc1'])
TaggedDocument(words=["It is rare."], tags=['doc1'])
  • Similar to above, but also assign a unique label for each sentence along with doc1. The full document has the all the sentence tags along with doc1.

Example -

Document -  "It is small. It is rare" 
TaggedDocument(words=["It is small. It is rare"], tags=['doc1', 'doc1_sentence1', 'doc1_sentence2'])
TaggedDocument(words=["It is small."], tags=['doc1', 'doc1_sentence1'])
TaggedDocument(words=["It is rare."], tags=['doc1', 'doc1_sentence2'])

I also have some additional categorical tags that I'd be assigning. So what would be the best approach?


Solution

  • You can do all this! Assigning the same tag to multiple texts has almost the same effect as would combining those texts into one larger text, and assigning it that tag. The slight differences would be for Doc2Vec modes where there's a context-window – PV-DM (dm=1). With separate texts, there'd never be contexts stretching across the end/beginning of sentences.

    In fact, as gensim's optimized code paths have a 10,000-token limit to text sizes, splitting larger documents into subdocuments, but repeating their tags is sometimes necessary as a workaround.

    What you've specifically proposed, training both the full-doc, and the doc-fragments, would work, but also have the effect of doubling the amount of text (and thus training-attention/individual-prediction-examples) for the 'doc1' tags, compared to the narrower per-sentence tags. You might want that, or not - it could affect the relative quality of each.

    What's best is unclear - it depends on your corpus, and end goals, so should be determined through experimentation, with a clear end-evaluation so that you can automate/systematize a rigorous search for what's best.

    A few relevant notes, though:

    • Doc2Vec tends to works better with docs of at least a dozen or more words per document.
    • The 'words' need to be tokenized - a list-of-strings, not a string.
    • It benefits from a lot of varied data, and in particular if you're training a larger model – more unique tags (including overlapping ones), and many-dimension vectors – you'll need more data to avoid overfitting.