Search code examples
gensimword2vectransfer-learningpre-trained-model

Fine-tuning pre-trained Word2Vec model with Gensim 4.0


With Gensim < 4.0, we can retrain a word2vec model using the following code:

model = Word2Vec.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
model.train(my_corpus, total_examples=len(my_corpus), epochs=model.epochs)

However, what I understand is that Gensim 4.0 is no longer supporting Word2Vec.load_word2vec_format. Instead, I can only load the keyedVectors.

How to fine-tune a pre-trained word2vec model (such as the model trained on GoogleNews) with my domain-specific corpus using Gensim 4.0?


Solution

  • I don't think that code would've ever have worked in Gensim versions before 4.0. A plain list-of-word-vectors, like GoogleNews-vectors-negative300.bin, does not (& never has) had enough info to continue training.

    It's missing the hidden-to-output layer weights & word-frequency info essential for training.

    Looking at past source code, as of release 1.0.0 (February 2017), that code wouldn've already given a deprecation-error with a pointer to the method for loading a plain set-of-word-vectors - to address people with the mistaken notion that could work – and raised other errors on any attempts to train() such a model. (Pre-1.0.0, docs also warned that this would not work, & would have failed with a less-helpful error.)

    As one of those errors mentioned, there has at times been experimental support for loading some of a prior set-of-word-vectors to clobber any words in an existing model's already-initialized vocabulary, via .intersect_word2vec_format(). But by default that both (1) locks the imported vectors against further change; (2) brings in no new words. That's unlike what people most often want from "fine-tuning", so it's not a ready-made help for that goal.

    I believe some people have cobbled together custom code to achieve various kinds of fine-tuning in their projects – but I don't know of anyone who's published a reliable recipe or strong results. (And I suspect some of the people who think they're doing this well just haven't rigorously evaluated the steps they are taking.)

    If you have any recipe you know worked pre-Gensim-4.0.0, it should be adaptable - 4.0 changes to the Word2Vec-related classes were mainly refactorings, optimizations, & new options (with little-to-none removal of functionality). But a reliable description of what used-to-work, or which particular fine-tuning strategy is being pursued for what specific benefits, to make more specific recommendations.