Search code examples
pythonnlpspacybert-language-model

Use BERT under spaCy to get sentence embeddings


I am trying to use BERT to get sentence embeddings. Here is how I am doing it:

import spacy
nlp = spacy.load("en_core_web_trf")
nlp("The quick brown fox jumps over the lazy dog").vector 

This outputs an empty vector!!

array([], dtype=float32)

Am I missing something?


Solution

  • Transformers are a bit different than the other spacy models, but you can use doc._.trf_data.tensors[1].

    The vectors for the individual BPE (Byte Pair Encoding) token-pieces are in doc._.trf_data.tensors[0]. Note that I use the term token-pieces rather than tokens, to prevent confusion between spacy tokens and the tokens that are produced by the BPE tokenizer.

    E.g., in our case the spacy-tokens are:

    for i, spacy_tok in enumerate(doc):
      print(f"spacy-token {i + 1}: {spacy_tok.text}")
    
    spacy-token 1: The
    spacy-token 2: quick
    spacy-token 3: brown
    spacy-token 4: fox
    spacy-token 5: jumps
    spacy-token 6: over
    spacy-token 7: the
    spacy-token 8: lazy
    spacy-token 9: dog
    

    and the token-pieces are:

    for i, tok_piece in enumerate(doc._.trf_data.tokens['input_texts'][0]):
      print(f"token-piece {i + 1}: {tok_piece}")
    
    token-piece 1: <s>
    token-piece 2: The
    token-piece 3: Ġquick
    token-piece 4: Ġbrown
    token-piece 5: Ġfox
    token-piece 6: Ġjumps
    token-piece 7: Ġover
    token-piece 8: Ġthe
    token-piece 9: Ġlazy
    token-piece 10: Ġdog
    token-piece 11: </s>