Search code examples
pythonnlphuggingface-transformers

huggingface return probability and class label Trainer.predict


Is there any way to return probabilities and actual class using Trainer.predict ?

I checked the documentation at this page but couldn't figure out. As of now it seems to be returning logits

Obviously both probabilities and actual class could be computed using additional coding but wondering if there is any prebuilt method to do the same

my current output as below

new_predictions=trainer.predict(dataset_for_future_predicition_after_tokenizer)

new_predictions


PredictionOutput(predictions=array([[-0.43005577,  3.646306  , -0.8073783 , -1.0651836 , -1.3480505 ,
        -1.108013  ],
       [ 3.5415223 , -0.8513837 , -1.8553216 , -0.18011567, -0.35627165,
        -1.8364134 ],
       [-1.0167522 , -0.8911268 , -1.7115675 ,  0.01204597,  1.7177908 ,
         1.0401527 ],
       [-0.82407415, -0.46043932, -1.089274  ,  2.6252217 ,  0.33935028,
        -1.3623345 ]], dtype=float32), label_ids=None, metrics={'test_runtime': 0.0182, 'test_samples_per_second': 219.931, 'test_steps_per_second': 54.983})

Solution

  • As you mentioned, Trainer.predict returns the output of the model prediction, which are the logits.

    If you want to get the different labels and scores for each class, I recommend you to use the corresponding pipeline for your model depending on the task (TextClassification, TokenClassification, etc). This pipeline has a return_all_scores parameter on its __call__ method that allows you to get all scores for each label on a prediction.

    Here's an example:

    from transformers import TextClassificationPipeline, AutoTokenizer, AutoModelForSequenceClassification
    
    MODEL_NAME = "..."
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
    
    pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer)
    prediction = pipe("The text to predict", return_all_scores=True)
    

    This is an example of how this prediction variable will look like:

    [{label: 'LABEL1', score: 0.80}, {label: 'LABEL2', score: 0.15}, {label: 'LABEL3', score: 0.05}]
    

    The label names can be set on the model's config.json file or when creating the model (before training it) by defining id2label and label2id model parameters:

    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=num_labels,
        label2id={"Greeting": 0, "Help": 1, "Farewell": 2},
        id2label={0: "Greeting", 1: "Help", 2: "Farewell"},
    )