Search code examples
nlpword2vecgensimdoc2vec

how to use build_vocab in gensim?


  1. Build_vocab extend my old vocabulary?

For example, my idea is when I use doc2vec(s) to train a model, it just builds the vocabulary from the datasets. If I want to extend it, I need to use build_vocab()

  1. Where should I use it? Should I put it after "gensim.doc2vec()"?

For example:

sentences = gensim.models.doc2vec.TaggedLineDocument(f_path)
dm_model = gensim.models.doc2vec.Doc2Vec(sentences, dm=1, size=300, window=8, min_count=5, workers=4)
dm_model.build_vocab()

Solution

  • You should follow working examples in gensim documentation/tutorials/notebooks or online tutorials to understand which steps are necessary and in what order.

    In particular, if you provide your sentences corpus iterable on the Doc2Vec() initialization, it will automatically do both the vocabulary-discovery pass and all training – so you don’t then need to call either build_vocab() or train() yourself. And further, you would never call build_vocab() with no arguments. (No working example in docs or online will do what your code does – so don’t improvise new things until you’ve followed the examples and know why they do what they do.)

    There is an optional update argument to build_vocab(), which purports to allow the expansion of a vocabulary from an earlier training session (in preparation for further training with the newer words). HOWEVER, it’s only been developed/tested with regard to Word2Vec models – there are reports it causes crashes when used with Doc2Vec. And even in Word2Vec, its overall effects and best-ways-to-use aren’t clear, across all training modes. So I don’t recommend its use except for experts who can read & interpret the source code, and many involved tradeoffs, on their own. If you receive a chunk of new texts, with new words, the best-grounded course of action, and easiest to evaluate/reason-about, is to re-train from scratch, using a combined corpus of all text examples.