In the HuggingFace tokenizer, applying the max_length
argument specifies the length of the tokenized text. I believe it truncates the sequence to max_length-2
(if truncation=True
) by cutting the excess tokens from the right. For the purposes of utterance classification, I need to cut the excess tokens from the left, i.e. the start of the sequence in order to preserve the last tokens. How can I do that?
from transformers import AutoTokenizer
train_texts = ['text 1', ...]
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
encodings = tokenizer(train_texts, max_length=128, truncation=True)
Tokenizers have a truncation_side
parameter that should set exactly this.
See the docs.