I have a pretrained embeddings file, which was quantized, in .ftz format. I need it to look up words, find the nearest neighbours. But I fail to find any toolkits that can do that. FastText can load the embeddings file, yet not able to look up the nearest neighbour, Gensim can lookup the nearest neighbour, but not be able to load the model...
Or it's me not finding the right function?
Thank you!
FastText models come in two flavours:
To compress unsupervised models, I have created a package compress-fasttext which is a wrapper around Gensim that can reduce the size of unsupervised models by pruning and quantization. This post describes it in more details.
With this package, you can lookup similar words in small models as follows:
import compress_fasttext
small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load(
'https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin'
)
print(small_model.most_similar('Python'))
# [('PHP', 0.5253), ('.NET', 0.5027), ('Java', 0.4897), ... ]
Of course, it works only if the model has been compressed using the same package. You can compress your own unsupervised model this way:
import compress_fasttext
from gensim.models.fasttext import load_facebook_model
big_model = load_facebook_model('path-to-original-model').wv
small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)
small_model.save('path-to-new-model')