According to documentation:
spaCy's small models (all packages that end in sm) don't ship with word vectors, and only include context-sensitive tensors. [...] individual tokens won't have any vectors assigned.
But when I use the de_core_news_sm
model, the tokens Do have entries for x.vector
and x.has_vector=True
.
It looks like these are context_vectors, but as far as I understood the documentation only word vectors are accessible through the vector
attribute and sm
models should have none. Why does this work for a "small model"?
has_vector
behaves differently than you expect.
This is discussed in the comments on an issue raised on github. The gist is, since vectors are available, it is True
, even though those vectors are context vectors. Note that you can still use them, eg to compute similarity.
Quote from spaCy contributor Ines:
We've been going back and forth on how the has_vector should behave in cases like this. There is a vector, so having it return False would be misleading. Similarly, if the model doesn't come with a pre-trained vocab, technically all lexemes are OOV.
Version 2.1.0 has been announced to include German word vectors.