I'm using Gensim
for loading the german .bin
files from Fasttext
in order to get vector representations for out-of-vocabulary words and phrases. So far it works fine and I achieve good results overall.
I am familiar with the KeyError :'all ngrams for word <word> absent from model'.
Clearly the model doesn't provide a vector representation for every possible ngram combination.
But now I ran into a confusing (at least for me) issue.
I'll just give a quick example:
the model provides a representation for the phrase AuM Wert
.
But when I want to get a representation for AuM Wert 50 Mio. Eur
, I'll get the KeyError
mentioned above. So the model obviously has a representation for the shorter phrase but not for the extended one.
It even returns a representation for AuM Wert 50 Mio.Eur
(I just removed the space between 'Mio' and 'Eur')
I mean, the statement in the Error is simply not true, because the first example shows that it knows some of the ngrams. Can someone explain that to me? What don't I understand here? Is my understanding of ngrams wrong?
Heres the code:
from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('cc.de.300.bin')
model.wv['AuM Wert'] #returns a vector
model.wv['AuM Wert 50 Mio.EUR'] #returns a vector
model.wv['AuM Wert 50 Mio. EUR'] #triggers the error
Thanks in advance,
Amos
I'm not certain what's causing the behavior you're seeing, though I have a theory below.
But, take note that the current gensim behavior (through 3.7.1), of sometimes returning KeyError: all ngrams for word <...> absent
for an OOV word, does not conform to the behavior of Facebook's original FastText implementation, and is thus considered a bug.
It should be fixed in the next release. You can read a change note about the new compatible behavior.
So, in the near future with an up-to-date version of gensim, you will never see this `KeyError'.
In the meantime, factors that might explain your observed behavior include:
FastText
. Further, usual tokenizations of training texts will only pass word-tokens without any internal whitespace. So for a typical model, there's no chance such space-containing phrases will have full-word vectors. And, none of their character-n-grams that contains spaces will map to n-grams seen during training, either. To the extent you get a vector at all in gensim 3.7.1 and earlier, it will be because some of the n-grams not containing spaces were seen in training. (Post 3.7.1, you will always get a vector, though it may be composed from random collisions of the query-word's novel n-grams with n-grams learned in training, or simply with randomly-initialized-but-never-trained vectors inside the model's n-gram hashtable.)<
and >
. And the default n-gram size range is from 4 to 6 characters. So, your string 'AuM Wert'
will among its n-grams include '<AuM'
, 'Wert'
, and 'ert>'
. (All of its other n-grams will include a space character, and thus couldn't possibly be in the set of n-grams learned during training on words without spaces.). But note that the longer phrase, on which you get the error, will not include the n-gram 'ert>'
, because the prior end-of-token has been replaced with a space. So, the shorter phrase's n-grams is not a proper subset of the larger phrase's n-grams – and the larger phrase could error where the shorter does not. (And your longer phrase without a space, that does not error, also includes a number of extra 4-6 character n-grams that may have been in training data, that the erroring phrase does not have.)