Search code examples
documentationspacyword-embedding

Discrepancy documentation and implementation of spaCy vectors for German words?


According to documentation:

spaCy's small models (all packages that end in sm) don't ship with word vectors, and only include context-sensitive tensors. [...] individual tokens won't have any vectors assigned.

But when I use the de_core_news_sm model, the tokens Do have entries for x.vector and x.has_vector=True.

It looks like these are context_vectors, but as far as I understood the documentation only word vectors are accessible through the vector attribute and sm models should have none. Why does this work for a "small model"?


Solution

  • has_vector behaves differently than you expect.

    This is discussed in the comments on an issue raised on github. The gist is, since vectors are available, it is True, even though those vectors are context vectors. Note that you can still use them, eg to compute similarity.

    Quote from spaCy contributor Ines:

    We've been going back and forth on how the has_vector should behave in cases like this. There is a vector, so having it return False would be misleading. Similarly, if the model doesn't come with a pre-trained vocab, technically all lexemes are OOV.

    Version 2.1.0 has been announced to include German word vectors.