I train my doc2vec model:
data = ["Sentence 1",
"Sentence 2",
"Sentence 3",
"Sentence 4"]
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags[str(i)])
for i, _d in enumerate(data)]
training part:
model = Doc2Vec(size=100, window=10, min_count=1, workers=11, alpha=0.025,
min_alpha=0.025, iter=20)
model.build_vocab(tagged_data, update=False)
model.train(tagged_data,epochs=model.iter,total_examples=model.corpus_count)
Save model:
model.save("d2v.model")
And it's work. Than I want to add some sentence to my vocabulary and model. E.x.:
new_data = ["Sentence 5",
"Sentence 6",
"Sentence 7"]
new_tagged_data=
[TaggedDocument(words=word_tokenize(_d.lower()),tags[str(i+len(data))])
for i,_d in enumerate(new_data)]
And than update model:
model.build_vocab(new_tagged_data, update=True)
model.train(new_tagged_data,
epochs=model.iter,total_examples=model.corpus_count)
But it doesn't work. Jupiter urgently shut down and no answer. I use the same way with word2vec model and it works!
What can be a problem with this?
The build_vocab(..., update-True)
functionality was only developed, experimentally, in gensim for Word2Vec
and hasn't been tested/debugged for Doc2Vec
. There's a long-open crashing bug when trying to use it with Doc2Vec
:
https://github.com/RaRe-Technologies/gensim/issues/1019
So, it's not yet supported.
Separately, there are lots of murky & difficult issues related to the balance and vector-compatibility of models that are incrementally trained in this way, and if at all possible, you should re-train the model with the full old & new data, mixed together, rather than attempting small updates.