machine-learning deep-learning classification bert-language-model multiclass-classification

Calculating Probability of a Classification Model Prediction

I have a classification task. The training data has 50 different labels. The customer wants to differentiate the low probability predictions, meaning that, I have to classify some test data as Unclassified / Other depending on the probability (certainty?) of the model.

When I test my code, the prediction result is a numpy array (I'm using different models, this is one is pre-trained BertTransformer). The prediction array doesn't contain probabilities such as in Keras predict_proba() method. These are numbers generated by prediction method of pretrained BertTransformer model.

[[-1.7862008  -0.7037363   0.09885322  1.5318055   2.1137428  -0.2216074
   0.18905772 -0.32575375  1.0748093  -0.06001111  0.01083148  0.47495762
   0.27160102  0.13852511 -0.68440574  0.6773654  -2.2712054  -0.2864312
  -0.8428862  -2.1132915  -1.0157436  -1.0340284  -0.35126117 -1.0333195
   9.149789   -0.21288703  0.11455813 -0.32903734  0.10503325 -0.3004114
  -1.3854568  -0.01692022 -0.4388664  -0.42163098 -0.09182278 -0.28269592
  -0.33082992 -1.147654   -0.6703184   0.33038092 -0.50087476  1.1643585
   0.96983343  1.3400391   1.0692116  -0.7623776  -0.6083422  -0.91371405
   0.10002492]]

I'm using numpy.argmax() to identify the correct label. The prediction works just fine. However, since these are not probabilities, I cannot compare the best result with a threshold value.

My question is, how can I define a threshold (say, 0.6), and then compare the probability of the argmax() element of the BertTransformer prediction array so that I can classify the prediction as "Other" if the probability is less than the threshold value?

Edit 1:

We are using 2 different models. One is Keras, and the other is BertTransformer. We have no problem in Keras since it gives the probabilities so I'm skipping Keras model.

The Bert model is pretrained. Here is how it is generated:

def model(self, data):
        number_of_categories = len(data['encoded_categories'].unique())
        model = BertForSequenceClassification.from_pretrained(
            "dbmdz/bert-base-turkish-128k-uncased",
            num_labels=number_of_categories,
            output_attentions=False,
            output_hidden_states=False,
        )

        # model.cuda()

        return model

The output given above is the result of model.predict() method. We compare both models, Bert is slightly ahead, therefore we know that the prediction works just fine. However, we are not sure what those numbers signify or represent.

Here is the Bert documentation.

Solution

BertForSequenceClassification returns logits, i.e., the classification scores before normalization. You can normalize the scores by calling F.softmax(output, dim=-1) where torch.nn.functional was imported as F.

With thousands of labels, the normalization can be costly and you do not need it when you are only interested in argmax. This is probably why the models return the raw scores only.