Search code examples
nlpfasttext

How does pre-trained FastText handle multi-word queries?


Using the pre-trained model:

import fasttext.util
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')

Checking ft.words there aren't entries with spaces or _ in it, but if I query the model with multi-words (e.g. ft["get up"]) it returns a vector without any error. What does it do? Is it correct or should it be better to avoid these kind of queries?


Solution

  • FastText can synthesize a guess-vector, from word-fragments, for any string.

    It can work fairly well for typo or variant word-form of a word that was well-represented in training.

    For your 'word', 'get up', it might not work so well. There may have been no, or no-meaningful, character-n-grams in the training set of substrings of your 'word' like 'get ', 'et u', or 't up'. But as FastText uses a collision- and presence- oblivious hash-table for storing the n-gram vectors, these will still return essentially-random vectors.

    If you want instead something based on the per-word vectors for 'get' and 'up', I think you'd want to use the .get_sentence_vector() method, instead:

    https://github.com/facebookresearch/fastText/blob/master/python/README.md#model-object