Search code examples
pythonmachine-learningscikit-learnlogistic-regressionmulticlass-classification

How to get probabilities along with classification in LogisticRegression?


I am using Logistic regression algorithm for multi-class text classification. I need a way to get the confidence score along with the category. For eg - If I pass text = "Hello this is sample text" to the model, I should get predicted class = Class A and confidence = 80% as a result.


Solution

  • For most models in scikit-learn, we can get the probability estimates for the classes through predict_proba. Bear in mind that this is the actual output of the logistic function, the resulting classification is obtained by selecting the output with highest probability, i.e. an argmax is applied on the output. If we see the implementation here, you can see that it is essentially doing:

    def predict(self, X):
        # decision func on input array
        scores = self.decision_function(X)
        # column indices of max values per row
        indices = scores.argmax(axis=1)
        # index class array using indices
        return self.classes_[indices]
    

    In the case of calling predict_proba rather than predict, scores is returned. Here's an example use case training a LogisticRegression:

    from sklearn.datasets import load_iris
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    
    X, y = load_iris(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    lr= LogisticRegression()
    lr.fit(X_train, y_train)
    y_pred_prob = lr.predict_proba(X_test)
    
    y_pred_prob
    array([[1.06906558e-02, 9.02308167e-01, 8.70011771e-02],
           [2.57953117e-06, 7.88832490e-03, 9.92109096e-01],
           [2.66690975e-05, 6.73454730e-02, 9.32627858e-01],
           [9.88612145e-01, 1.13878133e-02, 4.12714660e-08],
           ...
    

    And we can obtain the probabilities by taking the argmax, as mentioned, and index the array of classes as:

    classes = load_iris().target_names
    classes[indices]
    array(['virginica', 'virginica', 'versicolor', 'virginica', 'setosa',
           'versicolor', 'versicolor', 'setosa', 'virginica', 'setosa',...
    

    So for a single prediction, through the predicted probabilities we could easily do something like:

    y_pred_prob = lr.predict_proba(X_test[0,None])
    ix = y_pred_prob.argmax(1).item()
    
    print(f'predicted class = {classes[ix]} and confidence = {y_pred_prob[0,ix]:.2%}')
    # predicted class = virginica and confidence = 90.75%