python-3.x scikit-learn probability multiclass-classification

probability difference between categorical target and one-hot encoding target using OneVsRestClassifier

A bit confused with the probability between categorical target and one-hot encoding target from OneVsRestClassifier of sklean. Using iris data with simple logistic regression as an example. When I use original iris class[0,1,2], the calculated OneVsRestClassifier() probability for each observation will always add up to 1. However, if I converted the target to dummies, this is not the case. I understand that OneVsRestClassifier() compares one vs rest (class 0 vs non class 0, class 1 vs non class 1, etc). It makes more sense that the sum of these probabilities has no relation with 1. Then why I see the difference and how so?

import numpy as np
import pandas as pd
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
np.set_printoptions(suppress=True)

iris = datasets.load_iris()
rng = np.random.RandomState(0)
perm = rng.permutation(iris.target.size)
X = iris.data[perm]
y = iris.target[perm]

# categorical target with no conversion

X_train, y_train1 = X[:80], y[:80]
X_test, y_test1 = X[80:], y[80:]

m3 = LogisticRegression(random_state=0)
clf1 = OneVsRestClassifier(m3).fit(X_train, y_train1)
y_pred1 = clf1.predict(X_test)
print(np.sum(y_pred1 == y_test))
y_prob1 = clf1.predict_proba(X_test)
y_prob1[:5]

#output
    array([[0.00014508, 0.17238549, 0.82746943],
           [0.03850173, 0.79646817, 0.1650301 ],
           [0.73981106, 0.26018067, 0.00000827],
           [0.00016332, 0.32231163, 0.67752505],
           [0.00029197, 0.2495404 , 0.75016763]])

# one hot encoding for categorical target

y2 = pd.get_dummies(y)
y_train2 = y2[:80]
y_test2 = y2[80:]
clf2 = OneVsRestClassifier(m3).fit(X_train, y_train2)
y_pred2 = clf2.predict(X_test)
y_prob2 = clf2.predict_proba(X_test)
y_prob2[:5]

#output
array([[0.00017194, 0.20430011, 0.98066319],
       [0.02152246, 0.44522562, 0.09225181],
       [0.96277892, 0.3385952 , 0.00001076],
       [0.00023024, 0.45436925, 0.95512082],
       [0.00036849, 0.31493725, 0.94676348]])

Solution

When you encode the targets, sklearn interprets your problem as a multilabel one instead of just multiclass; that is, that it is possible for a point to have more than one true label. And in that case, it is perfectly acceptable for the total sum of probabilities to be greater (or less) than 1. That's generally true for sklearn, but OneVsRestClassifier calls it out specifically in the docstring:

OneVsRestClassifier can also be used for multilabel classification. To use this feature, provide an indicator matrix for the target y when calling .fit.

As for the first approach, there are indeed three independent models, but the predictions are normalized; see the source code. Indeed, that's the only difference:

(y_prob2 / y_prob2.sum(axis=1)[:, None] == y_prob1).all()

# output
True

It's probably worth pointing out that LogisticRegression also natively supports multiclass. In that case, the weights for each class are independent, so it's similar to three separate models, but the resulting probabilities are the result of a softmax application, and the loss function minimizes the loss for each class simultaneously, so that the resulting coefficients and hence predictions can be different from those obtained from OneVsRestClassifier:

m3.fit(X_train, y_train1)
y_prob0 = m3.predict_proba(X_test)
y_prob0[:5]

# output:
array([[0.00000494, 0.01381671, 0.98617835],
       [0.02569699, 0.88835451, 0.0859485 ],
       [0.95239985, 0.04759984, 0.00000031],
       [0.00001338, 0.04195642, 0.9580302 ],
       [0.00002815, 0.04230022, 0.95767163]])