Search code examples
pythonmachine-learningnlpgensimfasttext

Why is gensim FastText model smaller in size than the native Fasttext model by Facebook?


It seems that the Gensim's implementation in FastText leads to a smaller model size than Facebook's native implementation. With a corpus of 1 million words, the fasttext native model is is 6GB, while the gensim fasttext model size is only 68MB.

Is there any information stored in Facebook's implementation not present in Gensim's implementation?


Solution

  • Please show which models generated this comparison, or what process was used. It probably has bugs/misunderstandings.

    The size of a model is more influenced by the number of unique words (and character n-gram buckets) than the 'corpus' size.

    The saved sizes of a Gensim-trained FastText model, or a native Facebook FastText-trained model, should be roughly in the same ballpark. Be sure to include all subsidiary raw numpy files (ending .npy, alongside the main save-file) created by Gensim's .save() - as all such files are required to re-.load() the model!

    Similarly, if you were to load a Facebook FastText model into Gensim, then use Gensim's .save(), the total disk space taken in both alternate formats should be quite close.