Search code examples
pythonpython-3.xnlpgensimdoc2vec

Doc2Vec online training


I train my doc2vec model:

data = ["Sentence 1",
        "Sentence 2",
        "Sentence 3",
        "Sentence 4"]

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags[str(i)]) 
                              for i, _d in enumerate(data)]

training part:

model = Doc2Vec(size=100, window=10, min_count=1, workers=11, alpha=0.025, 
                min_alpha=0.025, iter=20)

model.build_vocab(tagged_data, update=False)

model.train(tagged_data,epochs=model.iter,total_examples=model.corpus_count)

Save model:

model.save("d2v.model")

And it's work. Than I want to add some sentence to my vocabulary and model. E.x.:

new_data = ["Sentence 5",
            "Sentence 6",
            "Sentence 7"]
new_tagged_data= 
[TaggedDocument(words=word_tokenize(_d.lower()),tags[str(i+len(data))]) 
                for i,_d in enumerate(new_data)]

And than update model:

model.build_vocab(new_tagged_data, update=True)

model.train(new_tagged_data, 
            epochs=model.iter,total_examples=model.corpus_count)

But it doesn't work. Jupiter urgently shut down and no answer. I use the same way with word2vec model and it works!

What can be a problem with this?


Solution

  • The build_vocab(..., update-True) functionality was only developed, experimentally, in gensim for Word2Vec and hasn't been tested/debugged for Doc2Vec. There's a long-open crashing bug when trying to use it with Doc2Vec:

    https://github.com/RaRe-Technologies/gensim/issues/1019

    So, it's not yet supported.

    Separately, there are lots of murky & difficult issues related to the balance and vector-compatibility of models that are incrementally trained in this way, and if at all possible, you should re-train the model with the full old & new data, mixed together, rather than attempting small updates.