Loading the complete pre-trained word2vec model by Google is time intensive and tedious, therefore I was wondering if there is a chance to remove words below a certain frequency to bring the vocab
count down to e.g. 200k words.
I found Word2Vec methods in the gensim
package to determine the word frequency and to re-save the model again, but I am not sure how to pop
/remove
vocab from the pre-trained model before saving it again. I couldn't find any hint in the KeyedVector class
and the Word2Vec class
for such an operation?
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py
How can I select a subset of the vocabulary of the pre-trained word2vec model?
The GoogleNews word-vectors file format doesn't include frequency info. But, it does seem to be sorted in roughly more-frequent to less-frequent order.
And, load_word2vec_format()
offers an optional limit
parameter that only reads that many vectors from the given file.
So, the following should do roughly what you've requested:
goognews_wordecs = KeyedVectors.load_word2vec_format(`GoogleNews-vectors-negative300.bin.gz`, binary=True, limit=200000)