I'm using spacy 3.0 to vectorize a text with a transformer model. Due to data privacy reason the vectorization has to be on a different machine than the one that trains the model. To reduce the amount of data I generate and would have to transfer between machines, I extract the token ids of the text like this:
import spacy
nlp = spacy.load("de_dep_news_trf")
doc = nlp("Eine Bank steht im Park.")
print(doc._.trf_data.tokens["input_ids"])
which returns
tensor([[ 3, 917, 2565, 1302, 106, 3087, 26914, 4]])
Having the ids now, is it possible to extract the correct tensors from the language model (de_dep_news_trf
) using spacy?
Unfortunately, this is not possible. The problem is that Transformer models generate their embeddings for individual tokens on the context. Meaning, if you have he same token_id
in two different sentences, they will likely have a (significantly) different embedding. The only way is to return the tensor associated with each of the tokens, but you cannot generate them solely based on the input_ids
.