Search code examples
pythonnlphuggingface-transformershuggingface-tokenizers

Huggingface transformers padding vs pad_to_max_length


I'm running a code by using pad_to_max_length = True and everything works fine. Only I get a warning as follow:

FutureWarning: The pad_to_max_length argument is deprecated and will be removed in a future version, use padding=True or padding='longest' to pad to the longest sequence in the batch, or use padding='max_length' to pad to a max length. In this case, you can give a specific length with max_length (e.g. max_length=45) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).

But when I change pad_to_max_length = True to padding='max_length' I get this error:

RuntimeError: stack expects each tensor to be equal size, but got [60] at entry 0 and [64] at entry 6

How can I change the code to the new version? Is there anything I got wrong with the warning documentation?

This is my encoder:

encoding = self.tokenizer.encode_plus(
    poem,
    add_special_tokens=True,
    max_length= 60,
    return_token_type_ids=False,
    pad_to_max_length = True,
    return_attention_mask=True,
    return_tensors='pt',
)

Solution

  • It seems that the documentation is not complete enough!

    You should add truncation=True too to memic the pad_to_max_length = True.

    like this:

    encoding = self.tokenizer.encode_plus(
        poem,
        add_special_tokens=True,
        max_length=self.max_len,
        return_token_type_ids=False,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt',
    )