Small model from Google news Word2Vec model

I am using GoogleNews-vectors-negative300.bin model and pycontractions library to determinate with machine learning the best option to expand contractions when there are ambiguous meanings like I'd with can be I would and I had. The size of this model is very large, around to 3.5Gb.

I think that 3.5Gb is a very large model to use for my purpose. Probably I'll never use all words representations in this model. Is there a way to reduce the size extracting only a subset of words representations that are useful to my purposes?

Solution

Truncating the set to the 1st N words is easy with an optional argument to gensim's load_word2vec_format() method, limit. If present, only the given number of words will be loaded. For example, limit=500000 reads only the 1st 500,000 words from the supplied file.

Since such files are usually sorted to put the most-frequent words first, you often don't lose much by discarding the 'long tail' of later words. (They'll appear less frequently in your texts, and their word-vectors were trained on fewer examples and thus of lower-quality, anyway.)

You could then re-save_word2vec_format() the truncated set, if you wanted a smaller file on disk.

You could also tamper with the file on disk to make it only include some other subset of words to retain. It might be easier to do so in the text (binary=False) format. Looking at the gensim source code for load_word2vec_format()/save_word2vec_format() could help you to understand what the file must look like to read back in.