Search code examples
pythongensimword-embeddingfasttext

Load fasttext quantized model (.ftz), and look up words


I have a pretrained embeddings file, which was quantized, in .ftz format. I need it to look up words, find the nearest neighbours. But I fail to find any toolkits that can do that. FastText can load the embeddings file, yet not able to look up the nearest neighbour, Gensim can lookup the nearest neighbour, but not be able to load the model...

Or it's me not finding the right function?

Thank you!


Solution

  • FastText models come in two flavours:

    • unsupervised models that produce word embeddings and can find similar words. The native Facebook package does not support quantization for them.
    • supervised models that are used for text classification and can be quantized natively, but generally do not produce meaningful word embeddings.

    To compress unsupervised models, I have created a package compress-fasttext which is a wrapper around Gensim that can reduce the size of unsupervised models by pruning and quantization. This post describes it in more details.

    With this package, you can lookup similar words in small models as follows:

    import compress_fasttext
    small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load(
        'https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin'
    )
    print(small_model.most_similar('Python'))
    # [('PHP', 0.5253), ('.NET', 0.5027), ('Java', 0.4897),  ... ]
    

    Of course, it works only if the model has been compressed using the same package. You can compress your own unsupervised model this way:

    import compress_fasttext
    from gensim.models.fasttext import load_facebook_model
    big_model = load_facebook_model('path-to-original-model').wv
    small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)
    small_model.save('path-to-new-model')