Search code examples
pythonnlpspacytransformer-modelspacy-3

using spacy to extract tensor by token id


I'm using spacy 3.0 to vectorize a text with a transformer model. Due to data privacy reason the vectorization has to be on a different machine than the one that trains the model. To reduce the amount of data I generate and would have to transfer between machines, I extract the token ids of the text like this:

import spacy
nlp = spacy.load("de_dep_news_trf")
doc = nlp("Eine Bank steht im Park.")
print(doc._.trf_data.tokens["input_ids"])

which returns

tensor([[    3,   917,  2565,  1302,   106,  3087, 26914,     4]])

Having the ids now, is it possible to extract the correct tensors from the language model (de_dep_news_trf) using spacy?


Solution

  • Unfortunately, this is not possible. The problem is that Transformer models generate their embeddings for individual tokens on the context. Meaning, if you have he same token_id in two different sentences, they will likely have a (significantly) different embedding. The only way is to return the tensor associated with each of the tokens, but you cannot generate them solely based on the input_ids.