I'm trying to learn from an example which uses an older version of gensim. In particular, I have a section of code like:
word_vectors = Word2Vec(vector_size=word_vector_dim, min_count=1)
word_vectors.build_vocab(corpus_iterable)
word_vectors.intersect_word2vec_format(pretrained_dir + 'GoogleNews-vectors-negative300.bin.gz', binary=True)
My understanding is that this fills the word vector vocabulary with pre-trained word vectors when available. When the words in my vocabulary are not in the pretrained vectors, they are initialized to random values. However, the method intersect_word2vec_format
doesn't exist in the latest version of gensim. What is the cleanest way to replicate this process in gensim 4.0.0?
The .intersect_word2vec_format()
method still exists, but as an operation on a set of word-vectors, has moved to KeyedVectors
. So in some cases, older code that had called the method on a Word2Vec
model itself will need to call it on the model's .wv
property, holding a KeyedVectors
object, instead. EG:
w2v_model = Word2Vec(vector_size=word_vector_dim, min_count=1)
w2v_model.build_vocab(corpus_iterable)
# (you'll likely need another workaround here, see below)
w2v_model.wv.intersect_word2vec_format(pretrained_dir + 'GoogleNews-vectors-negative300.bin.gz', binary=True)
However, you'll still hit some problems:
vectors_lockf
values chosen).vectors_lockf
functionality will now, in Gensim 4+, require manual initialization by the knowledgeable - & because .intersect_word2vec_format()
assumes a particular pre-allocation, that method will break in Gensim 4.1 without an explicit workaround. See this open issue for more details.Most generally: pre-initializing with other word-vectors is at best a fussy, advanced technique, so be sure to study the code, consider the potential tradeoffs, & carefully evaluate its effects on your end-results, before embracing it. It's not an easy, automatic, or well-characterized shortcut.