Search code examples
pythongensimtraining-datadoc2vec

Can doc2vec be useful if training on Documents and inferring on sentences only


I am training with some documents with gensim's Doc2vec.

I have two types of inputs:

  1. Whole English Wikipedia: Each article of Wikipedia text is considered as one document for doc2vec training. (Total around 5.5 million articles or documents)
  2. Some documents related to my project that are manually prepared and collected from some websites. (around 15000 documents).
    Where each document the size is around 100 sentences.

Further, I want to use this model to infer sentences of size (10~20 words).

I request some clarification on my approach.
Is the method of training over documents(size of each document approx. 100 sentences each) and then inferring over new sentence correct. ?

Or, should I train over only sentences and not documents and then infer over the new sentence.?


Solution

  • Every corpus and project goals are different. Your approach of training on larger docs but then inferring on shorter sentences could plausibly work, but you have to try it to see how well, and then iteratively test whether perhaps shorter training docs (as single sentences or groups-of-sentences) work better, for your specific goal.

    Note that gensim Doc2Vec inference often gains from non-default parameters – especially more steps (than the tiny default 5) or a smaller starting alpha (more like the training default of 0.025), especially on shorter documents. And, that inference also may work better or worse depending on original model metaparameters.

    Note also that an implementation limit means that texts longer than 10,000 tokens are silently truncated in gensim Word2Vec/Doc2Vec training. (If you have longer docs, you can split them into less-than-10K-token subdocuments, but then repeat the tags for each subdocument, to closely simulate what effect training with a longer document would have had.)