python nlp huggingface-transformers sentiment-analysis

Will using arguments - max_length, truncate, and padding in tranformers pipeline affect the output?

Hello so I was checking sentiment of a text using transformers pretrained model ,but doing so gave me error

RuntimeError: The size of tensor a (1954) must match the size of tensor b (512) at non-singleton dimension 1

I went through few post which suggested that setting max_length as 512 will sort the error. It did resolve the error, but I want to know how it affects the quality of output. Does it truncate my text? For example, if the length of my text is 1195 will it process till 512, something like text[:512]?

Solution

Yes. It means the sentiment will be based on the first 512 tokens, and any tokens after that will not influence the result.

Note that this is tokens, not characters. If text was your raw string, and if we assume that on average each token is 2.5 characters, then truncating at 512 tokens would be the same as text[:1280].

(The characters per token can vary a lot based on the model, the tokenizer, the language, the domain, but mainly how unusual the string is compared to the text used to train the tokenizer.)

By the way, according to https://huggingface.co/docs/transformers/pad_truncation if you don't specify truncation then no truncation is applied; and if you do, but don't specify max_length then it will default to the maximum supported by the model. So setting max_length and not changing anything else shouldn't have fixed it. (I've not tested anything, or read the code, that is just based on my understanding of the documentation.)