I'm following Huggingface doc on calculating the perplexity of fixed-length models. I'm trying to verify that the formula works for various strings and I'm getting odd behavior. In particular, they mention
We don’t want the log-likelihood for the tokens we’re just treating as context to be included in our loss, so we can set these targets to -100 so that they are ignored
So given 2 different contexts but the same remaining tokens, the formula should return the same perplexity. However, it does not:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
context_1 = 'here is some context_1 and some more stuff'
context_2 = 'here is some context and some more stuff and more stuff aspodkaspd'
answer_1 = 'this is not the answer'
input_ids_wrong = tokenizer(context_1 + answer_1, return_tensors="pt").input_ids
input_ids_correct = tokenizer(context_2 + answer_1, return_tensors="pt").input_ids
context_1_tokens_length = len(tokenizer(context_1, return_tensors="pt").input_ids[0])
context_2_tokens_length = len(tokenizer(context_2, return_tensors="pt").input_ids[0])
target_ids_wrong = input_ids_wrong.clone()
target_ids_correct = input_ids_correct.clone()
target_ids_wrong[:, :context_1_tokens_length] = -100
target_ids_correct[:, :context_2_tokens_length] = -100
print('target_ids_wrong', target_ids_wrong)
print('target_ids_correct', target_ids_correct)
with torch.no_grad():
outputs_wrong = model(input_ids_wrong, labels=target_ids_wrong)
outputs_correct = model(input_ids_correct, labels=target_ids_correct)
neg_log_likelihood_wrong = outputs_wrong.loss
neg_log_likelihood_correct = outputs_correct.loss
ppl_wrong = torch.exp(neg_log_likelihood_wrong)
ppl_correct = torch.exp(neg_log_likelihood_correct)
print('ppl_wrong', ppl_wrong)
print('ppl_correct', ppl_correct)
Output:
target_ids_wrong tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 19,
59, 8, 1525, 1]])
target_ids_correct tensor([[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, -100, 19, 59, 8, 1525, 1]])
ppl_wrong tensor(9.0377)
ppl_correct tensor(21.1208)
I tried this with other models as well (e.g., gpt2
and sshleifer/tiny-gpt2
) and got the same odd behavior. From the T5 doc they wrote
we must make sure that padding token id’s of the labels are not taken into account by the loss function. In PyTorch and Tensorflow, this can be done by replacing them with -100, which is the ignore_index of the CrossEntropyLoss. So I don't understand why it takes the pad token into account. They also wrote in the same link which for T5 is equal to 0 (i.e. the id of the pad token)
So I tried replacing the -100
with 0
and actually a got different perplexity score (still different than each other, but different than the -100
). Which makes me think they don't actually ignore the -100
token for some reason.
Am I missing something?
I think you might not be thinking of "ignore the context" in the same whay that they are. When they want to context to be ignored, they effectively mean they want to compute the log probs for the answer conditioned on the context; e.g., they want something like P(answer|context_1) and P(answer|context_2) instead of P(context_1 + answer) or P(context_2 + answer). If you want to ignore the context entirely, that would be P(answer) - in which case just don't pass the context into the model.
Basically, the probability of the answer SHOULD change when given different contexts - but you want only the conditional probability of the answer given the context, not the joint of the answer and context. You want "How likely is this answer given this context?", not "How likely am I to see this context and answer in general".
Lastly, tokens with value of -100 are ignored by the cross entropy loss - that's why they're used, and why you get a different value if you set them to 0.