Mocking FastText model for utest

I am using fasttext models in my python library (from the official fasttext library). To run my u-tests, I need at some point a model (fasttext.FastText._FastText object), as light as possible so that I can version it in my repo.

I have tried to create a fake text dataset with 5 lines "fake.txt" and a few words and called

model = fasttext.train_unsupervised("./fake.txt")
fasttext.util.reduce_model(model, 2)
model.save_model("fake_model.bin")

It basically works but the model is 16Mb. It is kind of ok for a U-test resource but do you think I can go below this?

Solution

Note that FastText (& similar dense word-vector models) don't perform meaningfully when using toy-sized data or parameters. (All their useful/predictable/testable benefits depend on large, varied datasets & the subtle arrangements of many final vectors.)

But, if you just need a relatively meaningless object/file of the right type, your approach should work. The main parameter that would make a FastText model larger without regard to the tiny training-set is the bucket parameter, with a default value of 2000000. It will allocate that many character-ngram (word-fragment) slots, even if all your actual words don't have that many ngrams.

Setting bucket to some far-smaller value, in initial model creation, should make your plug/stand-in file far smaller as well.