python nlp gensim word-embedding fasttext

Memory efficiently loading of pretrained word embeddings from fasttext library with gensim

I would like to load pretrained multilingual word embeddings from the fasttext library with gensim; here the link to the embeddings:

https://fasttext.cc/docs/en/crawl-vectors.html

In particular, I would like to load the following word embeddings:

cc.de.300.vec (4.4 GB)
cc.de.300.bin (7 GB)

Gensim offers the following two options for loading fasttext files:

gensim.models.fasttext.load_facebook_model(path, encoding='utf-8')
- Load the input-hidden weight matrix from Facebook’s native fasttext .bin output file.
- load_facebook_model() loads the full model, not just word embeddings, and enables you to continue model training.
gensim.models.fasttext.load_facebook_vectors(path, encoding='utf-8')
- Load word embeddings from a model saved in Facebook’s native fasttext .bin format.
- load_facebook_vectors() loads the word embeddings only. Its faster, but does not enable you to continue training.

Source Gensim documentation: https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_model

Since my laptop has only 8 GB RAM, I am continuing to get MemoryErrors or the loading takes a very long time (up to several minutes).

Is there an option to load these large models from disk more memory efficient?

Solution

As vectors will typically take at least as much addressable-memory as their on-disk storage, it will be challenging to load fully-functional versions of those vectors into a machine with only 8GB RAM. In particular:

once you start doing the most common operation on such vectors – finding lists of the most_similar() words to a target word/vector – the gensim implementation will also want to cache a set of the word-vectors that's been normalized to unit-length – which nearly doubles the required memory
current versions of gensim's FastText support (through at least 3.8.1) also waste a bit of memory on some unnecessary allocations (especially in the full-model case)

If you'll only be using the vectors, not doing further training, you'll definitely want to use only the load_facebook_vectors() option.

If you're willing to give up the model's ability to synthesize new vectors for out-of-vocabulary words, not seen during training, then you could choose to load just a subset of the full-word vectors from the plain-text .vec file. For example, to load just the 1st 500K vectors:

from gensim.models.keyedvectors import KeyedVectors
KeyedVectors.load_word2vec_format('cc.de.300.vec', limit=500000)

Because such vectors are typically sorted to put the more-frequently-occurring words first, often discarding the long tail of low-frequency words isn't a big loss.