Using the pre-trained model:
import fasttext.util
fasttext.util.download_model('en', if_exists='ignore') # English
ft = fasttext.load_model('cc.en.300.bin')
Checking ft.words
there aren't entries with spaces or _ in it, but if I query the model with multi-words (e.g. ft["get up"]
) it returns a vector without any error.
What does it do? Is it correct or should it be better to avoid these kind of queries?
FastText can synthesize a guess-vector, from word-fragments, for any string.
It can work fairly well for typo or variant word-form of a word that was well-represented in training.
For your 'word', 'get up'
, it might not work so well. There may have been no, or no-meaningful, character-n-grams in the training set of substrings of your 'word' like 'get '
, 'et u'
, or 't up'
. But as FastText uses a collision- and presence- oblivious hash-table for storing the n-gram vectors, these will still return essentially-random vectors.
If you want instead something based on the per-word vectors for 'get'
and 'up'
, I think you'd want to use the .get_sentence_vector()
method, instead:
https://github.com/facebookresearch/fastText/blob/master/python/README.md#model-object