Search code examples
pythongensimkeyerrorword2vec

Gensim word2vec augment or merge pre-trained vectors


I am loading pre-trained vectors from a binary file generated from the word2vec C code with something like:

model_1 = Word2Vec.load_word2vec_format('vectors.bin', binary=True)

I am using those vectors to generate vector representations of sentences that contain words that may not have already existing vectors in vectors.bin. For example, if vectors.bin has no associated vector for the word "yogurt", and I try

yogurt_vector = model_1['yogurt']

I get KeyError: 'yogurt', which makes good sense. What I want is to be able to take the sentence words that do not have corresponding vectors and add representations for them to model_1. I am aware from this post that you cannot continue to train the C vectors. Is there then a way to train a new model, say model_2, for the words with no vectors and merge model_2 with model_1?

Alternatively, is there a way to test if the model contains a word before I actually try to retrieve it, so that I can at least avoid the KeyError?


Solution

  • Avoiding the key error is easy:

    [x for x in 'this model hus everything'.split() if x in model_1.vocab]
    

    The more difficult problem is merging a new word to an existing model. The problem is that word2vec calculates the likelihood of 2 words being next to each other, and if the word 'yogurt' wasn't in the first body that the model was trained on it's not next to any of those words, so the second model would not correlate to the first.

    You can look at the internals when a model is saved (uses numpy.save) and I would be interested in working with you to come up with code to allow adding vocabulary.