I have been trying to load GoogleNews Vector file to gensim models. The program never finished loading and I keep getting MemoryError. A few days ago, I didn't have that problem. I don't know why I get this problem all of a sudden.
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('../../data/GoogleNews-vectors-negative300.bin', binary=True)
## Above is my simplified python file and how I load the model.
Traceback (most recent call last):
File "test.py", line 4, in <module>
model = gensim.models.KeyedVectors.load_word2vec_format('../../data/GoogleNews-vectors-negative300.bin', binary=True)
File "/usr/local/lib64/python3.6/site-packages/gensim/models/keyedvectors.py", line 1549, in load_word2vec_format
limit=limit, datatype=datatype)
File "/usr/local/lib64/python3.6/site-packages/gensim/models/utils_any2vec.py", line 286, in _load_word2vec_format
vocab_size, vector_size, datatype, unicode_errors, binary_chunk_size)
File "/usr/local/lib64/python3.6/site-packages/gensim/models/utils_any2vec.py", line 205, in _word2vec_read_binary
result, counts, chunk, vocab_size, vector_size, datatype, unicode_errors)
File "/usr/local/lib64/python3.6/site-packages/gensim/models/utils_any2vec.py", line 190, in _add_bytes_to_result
_add_word_to_result(result, counts, word, vector, vocab_size)
File "/usr/local/lib64/python3.6/site-packages/gensim/models/utils_any2vec.py", line 169, in _add_word_to_result
result.vocab[word] = gensim.models.keyedvectors.Vocab(index=word_id, count=word_count)
MemoryError
You are getting a MemoryError
because your system lacks enough RAM to finish the operation.
The GoogleNews
vectors are over 3GB on disk, and require more RAM than that to load into the Python object heap. Even if you were doing nothing else on the same machine, it's doubtful you could do much with them on a system with 4GB of RAM - you'd need 8GB, or more, depending on what else is using memory on the machine, and in your Python process(es).
If the same step was succeeding a few days ago, it is certain that the system you were using then (even if the same system as now) had more free memory at the time you attempted the load then, compared to now.
Your options are:
The Gensim KeyedVectors.load_word2vec_format()
method takes an optional limit
parameter which only reads exactly that many words from the front of the supplied file. As the GoogleNews
model includes 3 million words, using something like limit=500000
loads jut 1/6th of the words, and thus uses about 1/6th of the RAM.
That's still a ton of words! And, as such models typically list the most-frequently-used words first, a limit=500000
only discards less-frequently-used words. Sometimes with natural-language-processing, discarding more of the rare words can even improve results on common tasks. (Rarer word senses-of-meaning can vary more, their vectors are often lower-quality as they've been trained on fewer examples, and yet altogether they are quite numerous overall – sometimes making their inclusion cost more in model size, and processing time, than any incremental meaning they deliver.)
Separately, and unlikely to be a major factor in your issue: it appears you're using a years-old version of Gensim. Generally, efficiency with regard to memory-usage and task-runtimes will improve with later versions. So no matter how you get around this particular MemoryError
, you should generally prefer to use a current version of Gensim, such as 4.2.0 (as of this writing in June 2022).