Search code examples
pythongensimfasttext

How to reduce RAM consumption of gensim fasttext model through training parameters?


What parameters when training a gensim fasttext model have the biggest effect on the resulting models' size in memory?

gojomos answer to this question mentions ways to reduce a model's size during training, apart from reducing embedding dimensionality.

There seem a few parameters that might have an effect: thresholds for including words in the vocabulary especially. Do the other parameters also influence model size, for example ngram range, and which parameters have the largest effect?

I hope this is not too lazy of a question :-)


Solution

  • The main parameters affecting FastText model size are:

    • vector_size (dimensionality) - the size of the model is overwhelmingly a series of vectors (both whole-word and n-gram) of this length. Thus, reducing vector_size has a direct, large effect on total model size.
    • min_count and/or max_final_vocab - by affecting how many whole words are considered 'known' (in-vocabulary) for the model, these directly influence how many bulk vectors are in the model. Especially if you have large enough training data that model size is an issue – & are using FastText – you should be considering higher values than the default min_count=5. Very-rare words with just a handful of usage examples typically don't learn good generalizable representations in word2vec-like models. (Good vectors come from many subtly-contrasting usage examples.) But because by Zipfian distributions, there are typically a lot of such words in natural language data, they do wind up taking a lot of the training time, & tug against other words' training, & push more-frequent words out of each-other's context windows. Hence this is a case where, counter to many peoples' intuition, throwing away some data (the rarest words) can often improve the final model.
    • bucket – which specifies exactly how may n-gram vectors will be learned by the model, because they all share a collision-oblivious hashmap. That is, no matter how many unique n-grams there really are in the training data, they'll all be forced into exactly this many vectors. (Essentially, rarer n-grams will often collide with more-frequent ones, and be just background noise.)

    Notably, because of the collisions tolerated by the bucket-sized hashmap, the parameters min_n & max_n actually don't affect the model size at all. Whether they allow for lots of n-grams of many sizes, or much fewer of a single/smaller range of sizes, they'll be shoehorned into the same number of buckets. (If more n-grams are used, a larger bucket value may help reduce collisions, and with more n-grams, training time will be longer. But the model will only grow with a larger bucket, not different min_n & max_n values.)

    You can get a sense of a model's RAM size by using .save() to save it to disk - the size of the multiple related files created (without compression) will roughly be of a similar magnitude as the RAM needed by the model. So, you can improve your intuition for how varying parameters changes the model size, by running varied-parameter experiments with smaller models, and watching their different .save()-sizes. (Note that you don't actually have to .train() these models - they'll take up their full allocated size once the .build_vocab() step has completed.)