Why doesn't a trained RandomForestClassifier
with specific parameters match the performance of varying those parameters with a GridSearchCV
?
def random_forest(X_train, y_train):
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.model_selection import train_test_split
X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, random_state=0)
# various combinations of max depth and max features
max_depth_vals = [1,2,3]
max_features_vals = [2,3,4]
grid_values = {'max_depth': max_depth_vals, 'max_features': max_features_vals}
# build GridSearch
clf = RandomForestClassifier(n_estimators=10)
grid = GridSearchCV(clf, param_grid=grid_values, cv=3, scoring='roc_auc')
grid.fit(X_train, y_train)
y_hat_proba = grid.predict_proba(X_validate)
print('Train Grid best parameter (max. AUC): ', grid.best_params_)
print('Train Grid best score (AUC): ', grid.best_score_)
print('Validation set AUC: ', roc_auc_score(y_validate, y_hat_proba[:,1]))
# build RandomForest with hard coded values. AUC should be ballpark to grid search
clf = RandomForestClassifier(max_depth=3, max_features=4, n_estimators=10)
clf.fit(X_train, y_train)
y_hat = clf.predict(X_validate)
y_hat_prob = clf.predict_proba(X_validate)[:, 1]
auc = roc_auc_score(y_hat, y_hat_prob)
print("\nMax Depth: 3 Max Features: 4\n---------------------------------------------")
print("auc: {}".format(auc))
return
Results - the grid search identifies the best parameters of max_depth=3
and max_features=4
and calculates a roc_auc_score
of 0.85
; when I put that through the code with the reserved validation set i get an roc_auc_score
of 0.84
. However when I code the classifier directly with those parameters it calculates an roc_auc_score
of 1.0
. My understanding is that it should be in the same ballpark ~0.85 but this feels way off.
Validation set AUC: 0.8490471073563559
Grid best parameter (max. AUC): {'max_depth': 3, 'max_features': 4}
Grid best score (AUC): 0.8599727094965482
Max Depth: 3 Max Features: 4
---------------------------------------------
auc: 1.0
I could be misunderstanding concepts, not apply techniques correctly, or even have coding issues. Thanks.
There are 2 issues:
To get reproducible results, specify the seed or random state wherever possible, e.g.
RandomForestClassifier(n_estimators=10, random_state=1234)
cv = StratifiedKFold(n_splits=3, random_state=1234)
GridSearchCV(clf, param_grid=grid_values, cv=cv, scoring='roc_auc')
You use the estimated labels instead of the true labels:
auc = roc_auc_score(y_hat, y_hat_prob)
Use the true labels instead:
auc = roc_auc_score(y_validate, y_hat_prob)