Search code examples
pythonscikit-learnrandom-forestrocauc

How does sklearn calculate AUC for random forest and why it is different when using different functions?


I start with the example given for ROC Curve with Visualization API:

import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import RocCurveDisplay
from sklearn.model_selection import train_test_split

X, y = load_wine(return_X_y=True)
y = y == 2

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(X_train, y_train)
ax = plt.gca()
rfc_disp = RocCurveDisplay.from_estimator(rfc, X_test, y_test, ax=ax, alpha=0.8)
print(rfc_disp.roc_auc)

with the answer 0.9823232323232323.

Following this immediately by

from sklearn.metrics import roc_auc_score
y_pred = rfc.predict(X_test)
auc = roc_auc_score(y_test, y_pred)
print(auc)

I obtain 0.928030303030303, which is manifestly different.

Interestingly, I obtain the same result with the ROC Curve Visualization API, if I use the predicted values:

rfc_disp1 = RocCurveDisplay.from_predictions(y_test, y_pred)
print(rfc_disp1.roc_auc)

However the area under the curve obtained does sum up to the former result (using trapezoid integration):

import numpy as np
I = np.sum(np.diff(rfc_disp.fpr) * (rfc_disp.tpr[1:] + rfc_disp.tpr[:-1])/2.)
print(I)

What is the reason for this discrepancy? I assume that it is related to how teh two functions calculate AUC (perhaps different way of smoothing the curve?) This brings me to a more general question: how is ROC curve obtained for random forest in sklearn? - what parameter/threshold is changed to obtain different predictions? Are these just scores for separate trees of the forest?


Solution

  • You should use predict_proba for AUC.

    try this one:

    from sklearn.metrics import roc_auc_score
    auc = roc_auc_score(y_test, rfc.predict_proba(X_test)[:, 1])
    print(auc)