With Gensim < 4.0, we can retrain a word2vec model using the following code:
model = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
model.train(my_corpus, total_examples=len(my_corpus), epochs=model.epochs)
However, what I understand is that Gensim 4.0 is no longer supporting Word2Vec.load_word2vec_format
. Instead, I can only load the keyedVectors.
How to fine-tune a pre-trained word2vec model (such as the model trained on GoogleNews) with my domain-specific corpus using Gensim 4.0?
I don't think that code would've ever have worked in Gensim versions before 4.0. A plain list-of-word-vectors, like GoogleNews-vectors-negative300.bin
, does not (& never has) had enough info to continue training.
It's missing the hidden-to-output layer weights & word-frequency info essential for training.
Looking at past source code, as of release 1.0.0 (February 2017), that code wouldn've already given a deprecation-error with a pointer to the method for loading a plain set-of-word-vectors - to address people with the mistaken notion that could work – and raised other errors on any attempts to train()
such a model. (Pre-1.0.0, docs also warned that this would not work, & would have failed with a less-helpful error.)
As one of those errors mentioned, there has at times been experimental support for loading some of a prior set-of-word-vectors to clobber any words in an existing model's already-initialized vocabulary, via .intersect_word2vec_format()
. But by default that both (1) locks the imported vectors against further change; (2) brings in no new words. That's unlike what people most often want from "fine-tuning", so it's not a ready-made help for that goal.
I believe some people have cobbled together custom code to achieve various kinds of fine-tuning in their projects – but I don't know of anyone who's published a reliable recipe or strong results. (And I suspect some of the people who think they're doing this well just haven't rigorously evaluated the steps they are taking.)
If you have any recipe you know worked pre-Gensim-4.0.0, it should be adaptable - 4.0 changes to the Word2Vec
-related classes were mainly refactorings, optimizations, & new options (with little-to-none removal of functionality). But a reliable description of what used-to-work, or which particular fine-tuning strategy is being pursued for what specific benefits, to make more specific recommendations.