Search code examples
huggingface-transformershuggingface-tokenizers

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation


I'm learning NLP following this sequence classification tutorial from HuggingFace https://huggingface.co/transformers/custom_datasets.html#sequence-classification-with-imdb-reviews The original code runs without problem. But when I tried to load a different tokenizer , such as the one from google/bert_uncased_L-4_H-256_A-4, the following warning appears:

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

from transformers import AutoTokenizer
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts[:50], labels[:50]

if __name__ == '__main__':
    test_texts, test_labels = read_imdb_split('aclImdb/test')
    tokenizer = AutoTokenizer.from_pretrained('google/bert_uncased_L-4_H-256_A-4')
    test_encodings = tokenizer(test_texts, truncation=True, padding=True)
    for input_id in test_encodings["input_ids"]:
        print(len(input_id))

The output shows all input_id has len = 1288. It seems they have all been padded to 1288. But how could I specify the truncation target length such as 512?


Solution

  • Specify the model_max_length when load the tokenizer.

    tokenizer = AutoTokenizer.from_pretrained('google/bert_uncased_L-4_H-256_A-4', model_max_length=512)