I'm running a fine-tuned model of BERT and ALBERT for Questing Answering. And, I'm evaluating the performance of these models on a subset of questions from SQuAD v2.0. I use SQuAD's official evaluation script for evaluation.
I use Huggingface transformers
and in the following you can find an actual code and example I'm running (might be also helpful for some folks who are trying to run fine-tuned model of ALBERT on SQuAD v2.0):
tokenizer = AutoTokenizer.from_pretrained("ktrapeznikov/albert-xlarge-v2-squad-v2")
model = AutoModelForQuestionAnswering.from_pretrained("ktrapeznikov/albert-xlarge-v2-squad-v2")
question = "Why aren't the examples of bouregois architecture visible today?"
text = """Exceptional examples of the bourgeois architecture of the later periods were not restored by the communist authorities after the war (like mentioned Kronenberg Palace and Insurance Company Rosja building) or they were rebuilt in socialist realism style (like Warsaw Philharmony edifice originally inspired by Palais Garnier in Paris). Despite that the Warsaw University of Technology building (1899\u20131902) is the most interesting of the late 19th-century architecture. Some 19th-century buildings in the Praga district (the Vistula\u2019s right bank) have been restored although many have been poorly maintained. Warsaw\u2019s municipal government authorities have decided to rebuild the Saxon Palace and the Br\u00fchl Palace, the most distinctive buildings in prewar Warsaw."""
input_dict = tokenizer.encode_plus(question, text, return_tensors="pt")
input_ids = input_dict["input_ids"].tolist()
start_scores, end_scores = model(**input_dict)
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1]).replace('▁', '')
print(answer)
And the output is like the following:
[CLS] why aren ' t the examples of bour ego is architecture visible today ? [SEP] exceptional examples of the bourgeois architecture of the later periods were not restored by the communist authorities after the war
As you can see there are BERT's special tokens in the answer including [CLS]
and [SEP]
.
I understand that in cases where the answer is just [CLS]
(having two tensor(0)
for start_scores
and end_scores
) it basically means model thinks there's no answer to the question in context which makes sense. And in these cases I just simply set the answer to that question to a null string when running the evaluation script.
But I wonder in cases like the example above, should I again assume that model could not find an answer and set the answer to empty string? or should I just leave the answer like that when I'm evaluating the model performance?
I'm asking this question because as far as I understand, the performance calculated using the evaluation script can change (correct me if I'm wrong) if I have such cases as answers and I may not get a realistic sense of the performance of these models.
You should simply treat them as invalid because you try to predict a proper answer span from the variable text
. Everything else should be invalid. This is also the way how huggingface treats this predictions:
We could hypothetically create invalid predictions, e.g., predict that the start of the span is in the question. We throw out all invalid predictions.
You should also note that they use a more sopisticated method to get the predictions for each question (don't ask me why they show torch.argmax in their example). Please have a look at the example below:
from transformers.data.processors.squad import SquadResult, SquadExample, SquadFeatures,SquadV2Processor, squad_convert_examples_to_features
from transformers.data.metrics.squad_metrics import compute_predictions_logits, squad_evaluate
###
#your example code
###
outputs = model(**input_dict)
def to_list(tensor):
return tensor.detach().cpu().tolist()
output = [to_list(output[0]) for output in outputs]
start_logits, end_logits = output
all_results = []
all_results.append(SquadResult(1000000000, start_logits, end_logits))
#this is the answers section from the evaluation dataset
answers = [{'text':'not restored by the communist authorities', 'answer_start':77}, {'text':'were not restored', 'answer_start':72}, {'text':'not restored by the communist authorities after the war', 'answer_start':77}]
examples = [SquadExample('0', question, text, 'not restored by the communist authorities', 75, 'Warsaw', answers,False)]
#this does basically the same as tokenizer.encode_plus() but stores them in a SquadFeatures Object and splits if neccessary
features = squad_convert_examples_to_features(examples, tokenizer, 512, 100, 64, True)
predictions = compute_predictions_logits(
examples,
features,
all_results,
20,
30,
True,
'pred.file',
'nbest_file',
'null_log_odds_file',
False,
True,
0.0,
tokenizer
)
result = squad_evaluate(examples, predictions)
print(predictions)
for x in result.items():
print(x)
Output:
OrderedDict([('0', 'communist authorities after the war')])
('exact', 0.0)
('f1', 72.72727272727273)
('total', 1)
('HasAns_exact', 0.0)
('HasAns_f1', 72.72727272727273)
('HasAns_total', 1)
('best_exact', 0.0)
('best_exact_thresh', 0.0)
('best_f1', 72.72727272727273)
('best_f1_thresh', 0.0)