I am working on a word-level classification task on multilingual data, I am using XLM-R, I know that XLM-R uses sentencepiece
as tokenizers which sometimes tokenizes words into subword.
For example the sentence "deception master" is tokenized as
, the word deception has been tokenized into two sub-words.
How can I get the embedding of deception
. I can take the mean of the subwords to get the embedding of the word as done here. But I have to implement my code in TensorFlow and TensorFlow computational graph doesn't support NumPy.
I could store the final hidden embeddings after taking the mean of the subwords into a NumPy array and give this array as input to the model, but I want to fine-tune the transformer.
How to get the word embeddings from the sub-word embeddings given by the transformer
Joining subword embeddings into words for word labeling is not how this problem is usually approached. The usual approach is the opposite: keep the subwords as they are, but adjust the labels to respect the tokenization of the pre-trained model.
One of the reasons is that the data is typically in batches. When merging subwords into words, every sentence in the batch would end up having a different length which would require processing each sentence independently and pad the batch again – this would be slow. Also, if you do not average the neighboring embeddings, you get more fine-grained information from the loss function, which tells explicitly what subword is responsible for an error.
When tokenizing using SentencePiece, you can get the indices in the original string:
from transformers import XLMRobertaTokenizerFast
tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
tokenizer("deception master", return_offsets_mapping=True)
This returns the following dictionary:
{'input_ids': [0, 8, 63928, 31347, 2],
'attention_mask': [1, 1, 1, 1, 1],
'offset_mapping': [(0, 0), (0, 2), (2, 9), (10, 16), (0, 0)]}
With the offsets, you can find out if the subword corresponds to a word that you want to label. There are various strategies that could be used for encoding the labels. The easiest one is just to copy the label to every subword. A more fancy way would be using schemes used in named entity recognition, such as IOB tagging that explicitly says what is the begging of the labeled segment.