Search code examples
pythonnlpgensimfasttext

FastText: Can't see the representation of words that starts with @ or @


I am working in an NLP project using FastText. I have some texts which contains words like @.poisonjamak, @aminagabread, @iamquak123 and I want to see their FastText representation. I want to mention that the model has the following form:

# FastText
ft_model = FastText(word_tokenized_corpus,
                    max_n=0,
                    vector_size=64,
                    window=5,
                    min_count=1,
                    sg=1,
                    workers=20,
                    epochs=50,
                    seed=42)

Using this I can see their representation, however I have an error

print(ft_model.wv['@.poisonjamak'])

KeyError: 'cannot calculate vector for OOV word without ngrams'

Of course, these words are in my texts. I have the above error in all these 3 words, however if I do the following this is working.

print(ft_model.wv['@.poisonjamak']) -----> print(ft_model.wv['poisonjamak'])
print(ft_model.wv['@aminagabread']) -----> print(ft_model.wv['aminagabread'])
print(ft_model.wv['@_iamquak123_']) -----> print(ft_model.wv['_iamquak123_'])

Question: So do you know why I have this problem?

Update: My dataset called 'df' and the column with texts called 'text'. I using the following code to prepare the texts for the fast text. The FastText is trained on word_tokenized_corpus

extra_list = df.text.tolist()
final_corpus = [sentence for sentence in extra_list if sentence.strip() !='']

word_punctuation_tokenizer = nltk.WordPunctTokenizer()
word_tokenized_corpus = [word_punctuation_tokenizer.tokenize(sent) for sent in final_corpus]

Solution

  • As comments note, the main issue is likely with your tokenizer, which won't put '@' characters inside your tokens. As a result, your FastText model isn't seeing the tokens you expect – but probably does have a word-vector for the 'word' '@'.

    Separately reviewing your actual word_tokenized_corpus, to see what it truly includes before the mdoel gets to do its training, is a good way to confirm this (or catch this class of error in the future).

    There is however another contributing issue: your use of the max_n=0 parameter. This essentially turns off subword learning, by qualifying no positive-length word-substrings (aka 'character n-grams') for vector-learning. This setting essentially turns FastText into plain Word2Vec.

    If instead you were using FastText in a more usual way, it would've learned subword-vectors for some of the subwords in 'aminagabread' etc, and thus would'vbe provided synthetic "guess" word-vectors for the full '@aminagabread' unseen OOV token.

    So in a way, you're only seeing the error letting you know about a problem in your tokenization because of this other deviation from usual FastText OOV behavior. If you really want FastText for its unique benefit of synthetic vectors for OOV words, you should return to a more typical max_n setting.

    Separate usage tips:

    • min_count=1 is usually a bad idea with such word2vec-family algorithms, as such rare words don't have enough varied usage examples to get good vectors themselves, but the failed attempt to try degrades training for surrounding words. Often, discarding such words (as with the default min_count=5 as if they weren't there at all improves downstream evaluations.
    • Because of some inherent threading inefficiencies of the Python Global Interpreter Lock ("GIL"), and the Gensim approach to iterating over your corpus in one thread, parcelling work out to worker threads, it is likely you'll get higher training throughput with fewer workers than your workers=20 setting, even if you have 20 (or far more) CPU cores. The exact best setting in any situation will vary by a lot of things, including some of the model parameters, and only trial-and-error can narrow the best values. But it's more likely to be in the 6-12 range, even when more cores are available, than 16+.