Search code examples
pythonpytorchhuggingface-transformersbert-language-modelhuggingface-tokenizers

How to apply max_length to truncate the token sequence from the left in a HuggingFace tokenizer?


In the HuggingFace tokenizer, applying the max_length argument specifies the length of the tokenized text. I believe it truncates the sequence to max_length-2 (if truncation=True) by cutting the excess tokens from the right. For the purposes of utterance classification, I need to cut the excess tokens from the left, i.e. the start of the sequence in order to preserve the last tokens. How can I do that?

from transformers import AutoTokenizer

train_texts = ['text 1', ...]
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
encodings = tokenizer(train_texts, max_length=128, truncation=True)

Solution

  • Tokenizers have a truncation_side parameter that should set exactly this. See the docs.