I am training with some documents with gensim's Doc2vec.
I have two types of inputs:
Further, I want to use this model to infer sentences of size (10~20 words).
I request some clarification on my approach.
Is the method of training over documents(size of each document approx. 100 sentences each) and then inferring over new sentence correct. ?
Or, should I train over only sentences and not documents and then infer over the new sentence.?
Every corpus and project goals are different. Your approach of training on larger docs but then inferring on shorter sentences could plausibly work, but you have to try it to see how well, and then iteratively test whether perhaps shorter training docs (as single sentences or groups-of-sentences) work better, for your specific goal.
Note that gensim
Doc2Vec
inference often gains from non-default parameters – especially more steps
(than the tiny default 5) or a smaller starting alpha
(more like the training default of 0.025
), especially on shorter documents. And, that inference also may work better or worse depending on original model metaparameters.
Note also that an implementation limit means that texts longer than 10,000 tokens are silently truncated in gensim
Word2Vec
/Doc2Vec
training. (If you have longer docs, you can split them into less-than-10K-token subdocuments, but then repeat the tags
for each subdocument, to closely simulate what effect training with a longer document would have had.)