sentiment-analysis bert-language-model huggingface-transformers huggingface-tokenizers

Why does Transformer's BERT (for sequence classification) output depend heavily on maximum sequence length padding?

I am using Transformer's RobBERT (the dutch version of RoBERTa) for sequence classification - trained for sentiment analysis on the Dutch Book Reviews dataset.

I wanted to test how well it works on a similar dataset (also on sentiment analysis), so I made annotations for a set of text fragments and checked its accuracy. When I checked what kind of sentence are misclassified, I noticed that the output for a unique sentence depends heavily on the length of padding I give when tokenizing. See code below.

from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch.nn.functional as F
import torch


model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robBERT-dutch-books", num_labels=2)
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robBERT-dutch-books", do_lower_case=True)

sent = 'De samenwerking gaat de laatste tijd beter'
max_seq_len = 64


test_token = tokenizer(sent,
                        max_length = max_seq_len,
                        padding = 'max_length',
                        truncation = True,
                        return_tensors = 'pt'
                        )

out = model(test_token['input_ids'],test_token['attention_mask'])

probs = F.softmax(out[0], dim=1).detach().numpy()

For the given sample text, which translates in English to "The collaboration has been improving lately", there is a huge difference in output on classification depending on the max_seq_len. Namely, for max_seq_len = 64 the output for probs is:

[[0.99149346 0.00850648]]

whilst for max_seq_len = 9, being the actual length including cls tokens:

[[0.00494814 0.9950519 ]]

Can anyone explain why this huge difference in classification is happening? I would think that the attention mask ensures that in the output there is no difference because of padding to the max sequence length.

Solution

This is caused because your comparison isn't correct. The sentence De samenwerking gaat de laatste tijd beter has actually 16 tokens (+2 for the specialtokens) and not 9. You only counted the words which are not necessarily the tokens.

print(tokenizer.tokenize(sent))
print(len(tokenizer.tokenize(sent)))

Output:

['De', 'Ġsam', 'en', 'wer', 'king', 'Ġga', 'at', 'Ġde', 'Ġla', 'at', 'ste', 'Ġt', 'ij', 'd', 'Ġbe', 'ter']
16

When you set the sequence length to 9 you are truncating the sentence to:

tokenizer.decode(tokenizer(sent,
                         max_length = 9,
                         padding = 'max_length',
                         truncation = True,
                         return_tensors = 'pt', 
                         add_special_tokens=False
                         )['input_ids'][0])

Output:

'De samenwerking gaat de la'

And as final prove, the output when you set max_length to 52 is also [[0.99149346 0.00850648]].