Search code examples
pythonnlpgensimfasttext

Fasttext representation for short phrase, but not for longer phrase containing the short one


I'm using Gensim for loading the german .bin files from Fasttext in order to get vector representations for out-of-vocabulary words and phrases. So far it works fine and I achieve good results overall.
I am familiar with the KeyError :'all ngrams for word <word> absent from model'. Clearly the model doesn't provide a vector representation for every possible ngram combination.
But now I ran into a confusing (at least for me) issue.
I'll just give a quick example:
the model provides a representation for the phrase AuM Wert.
But when I want to get a representation for AuM Wert 50 Mio. Eur, I'll get the KeyError mentioned above. So the model obviously has a representation for the shorter phrase but not for the extended one.
It even returns a representation for AuM Wert 50 Mio.Eur (I just removed the space between 'Mio' and 'Eur')
I mean, the statement in the Error is simply not true, because the first example shows that it knows some of the ngrams. Can someone explain that to me? What don't I understand here? Is my understanding of ngrams wrong?

Heres the code:

from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('cc.de.300.bin')
model.wv['AuM Wert'] #returns a vector
model.wv['AuM Wert 50 Mio.EUR'] #returns a vector
model.wv['AuM Wert 50 Mio. EUR'] #triggers the error

Thanks in advance,
Amos


Solution

  • I'm not certain what's causing the behavior you're seeing, though I have a theory below.

    But, take note that the current gensim behavior (through 3.7.1), of sometimes returning KeyError: all ngrams for word <...> absent for an OOV word, does not conform to the behavior of Facebook's original FastText implementation, and is thus considered a bug.

    It should be fixed in the next release. You can read a change note about the new compatible behavior.

    So, in the near future with an up-to-date version of gensim, you will never see this `KeyError'.

    In the meantime, factors that might explain your observed behavior include:

    • It's not typical to pass space-delimited phrases to FastText. Further, usual tokenizations of training texts will only pass word-tokens without any internal whitespace. So for a typical model, there's no chance such space-containing phrases will have full-word vectors. And, none of their character-n-grams that contains spaces will map to n-grams seen during training, either. To the extent you get a vector at all in gensim 3.7.1 and earlier, it will be because some of the n-grams not containing spaces were seen in training. (Post 3.7.1, you will always get a vector, though it may be composed from random collisions of the query-word's novel n-grams with n-grams learned in training, or simply with randomly-initialized-but-never-trained vectors inside the model's n-gram hashtable.)
    • N-grams are learned with a synthetic start-of-word prefix and end-of-word suffix – specifically the characters < and >. And the default n-gram size range is from 4 to 6 characters. So, your string 'AuM Wert' will among its n-grams include '<AuM', 'Wert', and 'ert>'. (All of its other n-grams will include a space character, and thus couldn't possibly be in the set of n-grams learned during training on words without spaces.). But note that the longer phrase, on which you get the error, will not include the n-gram 'ert>', because the prior end-of-token has been replaced with a space. So, the shorter phrase's n-grams is not a proper subset of the larger phrase's n-grams – and the larger phrase could error where the shorter does not. (And your longer phrase without a space, that does not error, also includes a number of extra 4-6 character n-grams that may have been in training data, that the erroring phrase does not have.)