python machine-learning scikit-learn logistic-regression multiclass-classification

How to get probabilities along with classification in LogisticRegression?

I am using Logistic regression algorithm for multi-class text classification. I need a way to get the confidence score along with the category. For eg - If I pass text = "Hello this is sample text" to the model, I should get predicted class = Class A and confidence = 80% as a result.

Solution

For most models in scikit-learn, we can get the probability estimates for the classes through predict_proba. Bear in mind that this is the actual output of the logistic function, the resulting classification is obtained by selecting the output with highest probability, i.e. an argmax is applied on the output. If we see the implementation here, you can see that it is essentially doing:

def predict(self, X):
    # decision func on input array
    scores = self.decision_function(X)
    # column indices of max values per row
    indices = scores.argmax(axis=1)
    # index class array using indices
    return self.classes_[indices]

In the case of calling predict_proba rather than predict, scores is returned. Here's an example use case training a LogisticRegression:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

lr= LogisticRegression()
lr.fit(X_train, y_train)
y_pred_prob = lr.predict_proba(X_test)

y_pred_prob
array([[1.06906558e-02, 9.02308167e-01, 8.70011771e-02],
       [2.57953117e-06, 7.88832490e-03, 9.92109096e-01],
       [2.66690975e-05, 6.73454730e-02, 9.32627858e-01],
       [9.88612145e-01, 1.13878133e-02, 4.12714660e-08],
       ...

And we can obtain the probabilities by taking the argmax, as mentioned, and index the array of classes as:

classes = load_iris().target_names
classes[indices]
array(['virginica', 'virginica', 'versicolor', 'virginica', 'setosa',
       'versicolor', 'versicolor', 'setosa', 'virginica', 'setosa',...

So for a single prediction, through the predicted probabilities we could easily do something like:

y_pred_prob = lr.predict_proba(X_test[0,None])
ix = y_pred_prob.argmax(1).item()

print(f'predicted class = {classes[ix]} and confidence = {y_pred_prob[0,ix]:.2%}')
# predicted class = virginica and confidence = 90.75%