What parameters when training a gensim fasttext model have the biggest effect on the resulting models' size in memory?
gojomos answer to this question mentions ways to reduce a model's size during training, apart from reducing embedding dimensionality.
There seem a few parameters that might have an effect: thresholds for including words in the vocabulary especially. Do the other parameters also influence model size, for example ngram range, and which parameters have the largest effect?
I hope this is not too lazy of a question :-)
The main parameters affecting FastText
model size are:
vector_size
(dimensionality) - the size of the model is overwhelmingly a series of vectors (both whole-word and n-gram) of this length. Thus, reducing vector_size
has a direct, large effect on total model size.min_count
and/or max_final_vocab
- by affecting how many whole words are considered 'known' (in-vocabulary) for the model, these directly influence how many bulk vectors are in the model. Especially if you have large enough training data that model size is an issue – & are using FastText
– you should be considering higher values than the default min_count=5
. Very-rare words with just a handful of usage examples typically don't learn good generalizable representations in word2vec-like models. (Good vectors come from many subtly-contrasting usage examples.) But because by Zipfian distributions, there are typically a lot of such words in natural language data, they do wind up taking a lot of the training time, & tug against other words' training, & push more-frequent words out of each-other's context windows. Hence this is a case where, counter to many peoples' intuition, throwing away some data (the rarest words) can often improve the final model.bucket
– which specifies exactly how may n-gram vectors will be learned by the model, because they all share a collision-oblivious hashmap. That is, no matter how many unique n-grams there really are in the training data, they'll all be forced into exactly this many vectors. (Essentially, rarer n-grams will often collide with more-frequent ones, and be just background noise.)Notably, because of the collisions tolerated by the bucket
-sized hashmap, the parameters min_n
& max_n
actually don't affect the model size at all. Whether they allow for lots of n-grams of many sizes, or much fewer of a single/smaller range of sizes, they'll be shoehorned into the same number of bucket
s. (If more n-grams are used, a larger bucket
value may help reduce collisions, and with more n-grams, training time will be longer. But the model will only grow with a larger bucket
, not different min_n
& max_n
values.)
You can get a sense of a model's RAM size by using .save()
to save it to disk - the size of the multiple related files created (without compression) will roughly be of a similar magnitude as the RAM needed by the model. So, you can improve your intuition for how varying parameters changes the model size, by running varied-parameter experiments with smaller models, and watching their different .save()
-sizes. (Note that you don't actually have to .train()
these models - they'll take up their full allocated size once the .build_vocab()
step has completed.)