Search code examples
pythongensim

Process to intersect with pre-trained word vectors with gensim 4.0.0


I'm trying to learn from an example which uses an older version of gensim. In particular, I have a section of code like:

word_vectors = Word2Vec(vector_size=word_vector_dim, min_count=1)
word_vectors.build_vocab(corpus_iterable)
word_vectors.intersect_word2vec_format(pretrained_dir + 'GoogleNews-vectors-negative300.bin.gz', binary=True)

My understanding is that this fills the word vector vocabulary with pre-trained word vectors when available. When the words in my vocabulary are not in the pretrained vectors, they are initialized to random values. However, the method intersect_word2vec_format doesn't exist in the latest version of gensim. What is the cleanest way to replicate this process in gensim 4.0.0?


Solution

  • The .intersect_word2vec_format() method still exists, but as an operation on a set of word-vectors, has moved to KeyedVectors. So in some cases, older code that had called the method on a Word2Vec model itself will need to call it on the model's .wv property, holding a KeyedVectors object, instead. EG:

    w2v_model = Word2Vec(vector_size=word_vector_dim, min_count=1)
    w2v_model.build_vocab(corpus_iterable)
    # (you'll likely need another workaround here, see below)
    w2v_model.wv.intersect_word2vec_format(pretrained_dir + 'GoogleNews-vectors-negative300.bin.gz', binary=True)
    

    However, you'll still hit some problems:

    • It's always been at best an experimental, advanced feature – and not a part of any well-documented processes. So it's best used if you're able to review its source code, & understand what limits & tradeoffs will come with using such (partially)-pre-initialized word-vectors, maybe-further-trained or maybe-frozen (depending on the vectors_lockf values chosen).
    • The equally experimental vectors_lockf functionality will now, in Gensim 4+, require manual initialization by the knowledgeable - & because .intersect_word2vec_format() assumes a particular pre-allocation, that method will break in Gensim 4.1 without an explicit workaround. See this open issue for more details.

    Most generally: pre-initializing with other word-vectors is at best a fussy, advanced technique, so be sure to study the code, consider the potential tradeoffs, & carefully evaluate its effects on your end-results, before embracing it. It's not an easy, automatic, or well-characterized shortcut.