I am using Transformer's RobBERT (the dutch version of RoBERTa) for sequence classification - trained for sentiment analysis on the Dutch Book Reviews dataset.
I wanted to test how well it works on a similar dataset (also on sentiment analysis), so I made annotations for a set of text fragments and checked its accuracy. When I checked what kind of sentence are misclassified, I noticed that the output for a unique sentence depends heavily on the length of padding I give when tokenizing. See code below.
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch.nn.functional as F
import torch
model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robBERT-dutch-books", num_labels=2)
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robBERT-dutch-books", do_lower_case=True)
sent = 'De samenwerking gaat de laatste tijd beter'
max_seq_len = 64
test_token = tokenizer(sent,
max_length = max_seq_len,
padding = 'max_length',
truncation = True,
return_tensors = 'pt'
)
out = model(test_token['input_ids'],test_token['attention_mask'])
probs = F.softmax(out[0], dim=1).detach().numpy()
For the given sample text, which translates in English to "The collaboration has been improving lately", there is a huge difference in output on classification depending on the max_seq_len. Namely, for max_seq_len = 64
the output for probs
is:
[[0.99149346 0.00850648]]
whilst for max_seq_len = 9
, being the actual length including cls tokens:
[[0.00494814 0.9950519 ]]
Can anyone explain why this huge difference in classification is happening? I would think that the attention mask ensures that in the output there is no difference because of padding to the max sequence length.
This is caused because your comparison isn't correct. The sentence De samenwerking gaat de laatste tijd beter
has actually 16 tokens (+2 for the specialtokens) and not 9. You only counted the words which are not necessarily the tokens.
print(tokenizer.tokenize(sent))
print(len(tokenizer.tokenize(sent)))
Output:
['De', 'Ġsam', 'en', 'wer', 'king', 'Ġga', 'at', 'Ġde', 'Ġla', 'at', 'ste', 'Ġt', 'ij', 'd', 'Ġbe', 'ter']
16
When you set the sequence length to 9 you are truncating the sentence to:
tokenizer.decode(tokenizer(sent,
max_length = 9,
padding = 'max_length',
truncation = True,
return_tensors = 'pt',
add_special_tokens=False
)['input_ids'][0])
Output:
'De samenwerking gaat de la'
And as final prove, the output when you set max_length
to 52 is also [[0.99149346 0.00850648]].