I am working in an NLP project using FastText. I have some texts which contains words like @.poisonjamak, @aminagabread, @iamquak123 and I want to see their FastText representation. I want to mention that the model has the following form:
# FastText
ft_model = FastText(word_tokenized_corpus,
max_n=0,
vector_size=64,
window=5,
min_count=1,
sg=1,
workers=20,
epochs=50,
seed=42)
Using this I can see their representation, however I have an error
print(ft_model.wv['@.poisonjamak'])
KeyError: 'cannot calculate vector for OOV word without ngrams'
Of course, these words are in my texts. I have the above error in all these 3 words, however if I do the following this is working.
print(ft_model.wv['@.poisonjamak']) -----> print(ft_model.wv['poisonjamak'])
print(ft_model.wv['@aminagabread']) -----> print(ft_model.wv['aminagabread'])
print(ft_model.wv['@_iamquak123_']) -----> print(ft_model.wv['_iamquak123_'])
Question: So do you know why I have this problem?
Update: My dataset called 'df' and the column with texts called 'text'. I using the following code to prepare the texts for the fast text. The FastText is trained on word_tokenized_corpus
extra_list = df.text.tolist()
final_corpus = [sentence for sentence in extra_list if sentence.strip() !='']
word_punctuation_tokenizer = nltk.WordPunctTokenizer()
word_tokenized_corpus = [word_punctuation_tokenizer.tokenize(sent) for sent in final_corpus]
As comments note, the main issue is likely with your tokenizer, which won't put '@'
characters inside your tokens. As a result, your FastText
model isn't seeing the tokens you expect – but probably does have a word-vector for the 'word' '@'
.
Separately reviewing your actual word_tokenized_corpus
, to see what it truly includes before the mdoel gets to do its training, is a good way to confirm this (or catch this class of error in the future).
There is however another contributing issue: your use of the max_n=0
parameter. This essentially turns off subword learning, by qualifying no positive-length word-substrings (aka 'character n-grams') for vector-learning. This setting essentially turns FastText
into plain Word2Vec
.
If instead you were using FastText
in a more usual way, it would've learned subword-vectors for some of the subwords in 'aminagabread'
etc, and thus would'vbe provided synthetic "guess" word-vectors for the full '@aminagabread'
unseen OOV token.
So in a way, you're only seeing the error letting you know about a problem in your tokenization because of this other deviation from usual FastText
OOV behavior. If you really want FastText
for its unique benefit of synthetic vectors for OOV words, you should return to a more typical max_n
setting.
Separate usage tips:
min_count=1
is usually a bad idea with such word2vec-family algorithms, as such rare words don't have enough varied usage examples to get good vectors themselves, but the failed attempt to try degrades training for surrounding words. Often, discarding such words (as with the default min_count=5
as if they weren't there at all improves downstream evaluations.workers=20
setting, even if you have 20 (or far more) CPU cores. The exact best setting in any situation will vary by a lot of things, including some of the model parameters, and only trial-and-error can narrow the best values. But it's more likely to be in the 6-12 range, even when more cores are available, than 16+.