Medium and large vocabularies of spaCy can generate vectors for words and phrases. Let's consider the following example:
import spacy
nlp = spacy.load("en_core_web_md")
tokens = nlp("apple cat sky")
print(tokens.text, tokens.vector[:3], tokens.vector_norm) # Only the first three components of the vector
for token in tokens:
print(token.text, token.vector[:3], token.vector_norm)
Output:
apple cat sky [-0.06734333 0.03672066 -0.13952099] 4.845729844425328
apple [-0.36391 0.43771 -0.20447] 7.1346846
cat [-0.15067 -0.024468 -0.23368 ] 6.6808186
sky [ 0.31255 -0.30308 0.019587] 6.617719
It is clear that the vocabulary contains vectors for each word, but how are the vectors for the entire phase generated? As one can see it is not just simple sum of vectors.
By default, the vector of a Doc
is the average of the vectors of the tokens, cf https://spacy.io/usage/vectors-similarity:
Models that come with built-in word vectors make them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors.