python pytorch huggingface-transformers bert-language-model huggingface-tokenizers

How to apply max_length to truncate the token sequence from the left in a HuggingFace tokenizer?

In the HuggingFace tokenizer, applying the max_length argument specifies the length of the tokenized text. I believe it truncates the sequence to max_length-2 (if truncation=True) by cutting the excess tokens from the right. For the purposes of utterance classification, I need to cut the excess tokens from the left, i.e. the start of the sequence in order to preserve the last tokens. How can I do that?

from transformers import AutoTokenizer

train_texts = ['text 1', ...]
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
encodings = tokenizer(train_texts, max_length=128, truncation=True)

Solution

Tokenizers have a truncation_side parameter that should set exactly this. See the docs.