Search code examples
nlpgensimword2vec

Reduce Google's Word2Vec model with Gensim


Loading the complete pre-trained word2vec model by Google is time intensive and tedious, therefore I was wondering if there is a chance to remove words below a certain frequency to bring the vocab count down to e.g. 200k words.

I found Word2Vec methods in the gensim package to determine the word frequency and to re-save the model again, but I am not sure how to pop/remove vocab from the pre-trained model before saving it again. I couldn't find any hint in the KeyedVector class and the Word2Vec class for such an operation?

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py

How can I select a subset of the vocabulary of the pre-trained word2vec model?


Solution

  • The GoogleNews word-vectors file format doesn't include frequency info. But, it does seem to be sorted in roughly more-frequent to less-frequent order.

    And, load_word2vec_format() offers an optional limit parameter that only reads that many vectors from the given file.

    So, the following should do roughly what you've requested:

    goognews_wordecs = KeyedVectors.load_word2vec_format(`GoogleNews-vectors-negative300.bin.gz`, binary=True, limit=200000)