AllenNLP 2.0: Can't get FBetaMultiLabelMeasure to run

I would like to compute the f1-score for a classifier trained with allen-nlp. I used the working code from a allen-nlp guide, which computed accuracy, not F1, so I tried to adjust the metric in the code.

According to the documentation, CategoricalAccuracy and FBetaMultiLabelMeasure take the same inputs. (predictions: torch.Tensor of shape [batch_size, ..., num_classes], gold_labels: torch.Tensor of shape [batch_size, ...])

But for some reason the input that worked perfectly well for the accuracy results in a RuntimeError when given to the f1-multi-label metric.

I condensed the problem to the following code snippet:

>>> from allennlp.training.metrics import CategoricalAccuracy, FBetaMultiLabelMeasure
>>> import torch
>>> labels = torch.LongTensor([0, 0, 2, 1, 0])
>>> logits = torch.FloatTensor([[ 0.0063, -0.0118,  0.1857], [ 0.0013, -0.0217,  0.0356], [-0.0028, -0.0512,  0.0253], [-0.0460, -0.0347,  0.0400], [-0.0418,  0.0254,  0.1001]])
>>> labels.shape
torch.Size([5])
>>> logits.shape
torch.Size([5, 3])
>>> ca = CategoricalAccuracy()
>>> f1 = FBetaMultiLabelMeasure()
>>> ca(logits, labels)
>>> f1(logits, labels)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../lib/python3.8/site-packages/allennlp/training/metrics/fbeta_multi_label_measure.py", line 130, in __call__
true_positives = (gold_labels * threshold_predictions).bool() & mask & pred_mask
RuntimeError: The size of tensor a (5) must match the size of tensor b (3) at non-singleton dimension 1

Why is this error happening? What am I missing here?

Solution

You want to use FBetaMeasure, not FBetaMultiLabelMeasure. "Multilabel" means you can specify more than one correct answer, but "Categorical Accuracy" only allows one correct answer. That means you have to specify another dimension in your labels.

I suspect the documentation of FBetaMultiLabelMeasure is misleading. I'll look into fixing it.