Search code examples
nlpspacyword2vec

How does spaCy generate vectors for phrases?


Medium and large vocabularies of spaCy can generate vectors for words and phrases. Let's consider the following example:

import spacy
    
nlp = spacy.load("en_core_web_md")
tokens = nlp("apple cat sky")
    
print(tokens.text, tokens.vector[:3], tokens.vector_norm) # Only the first three components of the vector 
    
for token in tokens:
    print(token.text, token.vector[:3], token.vector_norm)

Output:

apple cat sky [-0.06734333  0.03672066 -0.13952099] 4.845729844425328
apple [-0.36391  0.43771 -0.20447] 7.1346846
cat [-0.15067  -0.024468 -0.23368 ] 6.6808186
sky [ 0.31255  -0.30308   0.019587] 6.617719

It is clear that the vocabulary contains vectors for each word, but how are the vectors for the entire phase generated? As one can see it is not just simple sum of vectors.


Solution

  • By default, the vector of a Doc is the average of the vectors of the tokens, cf https://spacy.io/usage/vectors-similarity:

    Models that come with built-in word vectors make them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors.