Search code examples
pythonword2vecgensimdoc2vec

Can i build vocaburay in twice with gensim word2vec or doc2vec?


I have two different corpus and what i want is to train the model with both and to do it it I thought that it could be something like this:

model.build_vocab(sentencesCorpus1)
model.build_vocab(sentencesCorpus2)

Would it be right?


Solution

  • No: each time you call build_vocab(corpus), like that, it creates a fresh vocabulary from scratch – discarding any prior vocabulary.

    You can provide an optional argument to build_vocab(), update=True, which tries to add to the existing vocabulary. However:

    • it wasn't designed/tested with Doc2Vec in mind, and as of right now (February 2018), using it with Doc2Vec is unlikely to work and often causes memory-fault crashes. (See https://github.com/RaRe-Technologies/gensim/issues/1019.)

    • it's still best to train() with all available data together - any sort of multiple-calls to train(), with differing data subsets each time, introduces other murky tradeoffs in model quality/correctness that are easy to get wrong. (And, when calling train(), be sure to provide correct values for its required parameters – the practices shown in most online examples are typically only correct for the case where build_vocab() was called once, with exactly the same texts as later calling train().)