I have two different corpus and what i want is to train the model with both and to do it it I thought that it could be something like this:
model.build_vocab(sentencesCorpus1)
model.build_vocab(sentencesCorpus2)
Would it be right?
No: each time you call build_vocab(corpus)
, like that, it creates a fresh vocabulary from scratch – discarding any prior vocabulary.
You can provide an optional argument to build_vocab()
, update=True
, which tries to add to the existing vocabulary. However:
it wasn't designed/tested with Doc2Vec
in mind, and as of right now (February 2018), using it with Doc2Vec
is unlikely to work and often causes memory-fault crashes. (See https://github.com/RaRe-Technologies/gensim/issues/1019.)
it's still best to train()
with all available data together - any sort of multiple-calls to train()
, with differing data subsets each time, introduces other murky tradeoffs in model quality/correctness that are easy to get wrong. (And, when calling train()
, be sure to provide correct values for its required parameters – the practices shown in most online examples are typically only correct for the case where build_vocab()
was called once, with exactly the same texts as later calling train()
.)