I have the code below (using sklearn) that first uses the training set for cross-validation, and for a final check, uses the test set. However, the cross-validation consistently performs better, as shown below. Am I over-fitting on the training data? If so, which hyper parameter(s) would be best to tune to avoid this?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Cross validation
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc' }
scores = cross_validate(rfc, X_train, y_train, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
mean(scores['test_precision']),
mean(scores['test_recall']),
mean(scores['test_f1']),
mean(scores['test_roc_auc'])
)
Which gives me:
0.8536558341101569 0.8641939667622551 0.8392201023654705 0.8514895113569482 0.9264002192260914
Re-train the model now with the entire training+validation set, and test it with never-seen-before test-set
RFC = RandomForestClassifier()
RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::,1]
auc = roc_auc_score(y_test, y_pred_proba)
print(accuracy,
precision,
recall,
f1,
auc
)
Now it gives me the numbers below, which are clearly worse:
0.7809788654060067 0.5113236034222446 0.5044687189672294 0.5078730317420644 0.7589037004728368
I am able to reproduce your scenario with Pima Indians Diabetes Dataset.
The difference you see in the prediction metrics is not consistence and in some runs you may even notice the opposite, because it depends on the selection of the X_test during the split - some of the cases will be easier to predict and will give better metrics and vice versa. While Cross-validation runs predictions on the whole set you have in rotation and aggregates this effect, the single X_test set will suffer from effects of random splits.
In order to have better visibility on what is happening here, I have modified your experiment and split in two steps:
I use the whole of the X and y sets and run rest of the code as it is
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
# cv = KFold(n_splits=10)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
scores = cross_validate(rfc, X, y, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
mean(scores['test_precision']),
mean(scores['test_recall']),
mean(scores['test_f1']),
mean(scores['test_roc_auc'])
)
Output:
0.768257006151743 0.6943032069967433 0.593436328663432 0.6357667086829574 0.8221242747913622
Next I run the plain train-test step, but I do it 50 times with the different train_test splits, and average the metrics (similar to Cross-validation step).
accuracies = []
precisions = []
recalls = []
f1s = []
aucs = []
for i in range(50):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
RFC = RandomForestClassifier()
RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::, 1]
auc = roc_auc_score(y_test, y_pred_proba)
accuracies.append(accuracy)
precisions.append(precision)
recalls.append(recall)
f1s.append(f1)
aucs.append(auc)
print(mean(accuracies),
mean(precisions),
mean(recalls),
mean(f1s),
mean(aucs)
)
Output:
0.7606926406926405 0.7001931059992001 0.5778712922956755 0.6306501622080503 0.8207846633339568
As expected the prediction metrics are similar. However, the Cross-validation runs much faster and uses each data point of the whole data set for testing (in rotation) by a given number of times.