I want to train a previous-trained word2vec model in a increased way that is update the word's weights if the word has been seen in the previous training process and create and update the weights of the new words that has not been seen in the previous training process. For example:
from gensim.models import Word2Vec
# old corpus
corpus = [["0", "1", "2", "3"], ["2", "3", "1"]]
# first train on old corpus
model = Word2Vec(sentences=corpus, size=2, min_count=0, window=2)
# checkout the embedding weights for word "1"
print(model["1"])
# here comes a new corpus with new word "4" and "5"
newCorpus = [["4", "1", "2", "3"], ["1", "5", "2"]]
# update the previous trained model
model.build_vocab(newCorpus, update=True)
model.train(newCorpus, total_examples=model.corpus_count, epochs=1)
# check if new word has embedding weights:
print(model["4"]) # yes
# check if previous word's embedding weights are updated
print(model["1"]) # output the same as before
It seems that the previous word's embedding is not updated even though the previous word's context has benn changed in the new corpus. Could someone tell me how to make the previous embedding weights updated?
Answer for original question
Try printing them out (or even just a few leading dimensions, eg print(model['1'][:5])
) before & after to see if they've changed.
Or, at the beginning, make preEmbed
a proper copy of the values (eg: preEmbed = model['1'].copy()
).
I think you'll see the values have really changed.
Your current preEmbed
variable will only be a view into the array that changes along with the underlying array, so will always return True
s for your later check.
Reviewing a writeup on Numpy Copies & Views will help explain what's happening with further examples.
Answer for updated code
It's likely that in your subsequent single-epoch training, all examples of '1'
are being skipped via the sample
downsampling feature, because '1'
is a very-frequent word in your tiny corpus: 28.6% of all words. (In realistic natural-language corpora, the most-frequent word won't be more than a few percent of all words.)
I suspect if you disable this downsampling feature with sample=0
, you'll see the changes you expect.
(Note that this feature is really helpful with adequate training data, and more generally, lots of things about Word2Vec
& related algorithms, and especially their core benefits, require lots of diverse data – and won't work well, or behave in expected ways, with toy-sized datasets.)
Also note: your second .train()
should use an explicitly accurate count for the newCorpus
. (Using total_examples=model.corpus_count
to re-use the cached corpus count may not always be appropriate when you're supplying extra data, even if it works OK here.)
Another thing to watch out for: once you start using a model for more-sophisticated operations like .most_similar()
, it will have cached some calculated data for vector-to-vector comparisons, and this data won't always (at least through gensim-3.8.3
) be refreshed with more training. So, you may have to discard that data (in gensim-3.8.3
by model.wv.vectors_norm = None
) to be sure to have fresh unit-normed vectors, or fresh most_similar()
(& related method) results.