python machine-learning scikit-learn cross-validation train-test-split

Why does my cross-validation consistently perform better than train-test split?

I have the code below (using sklearn) that first uses the training set for cross-validation, and for a final check, uses the test set. However, the cross-validation consistently performs better, as shown below. Am I over-fitting on the training data? If so, which hyper parameter(s) would be best to tune to avoid this?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Cross validation
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)   
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc' }
scores = cross_validate(rfc, X_train, y_train, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
      mean(scores['test_precision']),
      mean(scores['test_recall']),
      mean(scores['test_f1']),
      mean(scores['test_roc_auc'])
      )

Which gives me:

0.8536558341101569 0.8641939667622551 0.8392201023654705 0.8514895113569482 0.9264002192260914

Re-train the model now with the entire training+validation set, and test it with never-seen-before test-set

RFC = RandomForestClassifier()

RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::,1] 
auc = roc_auc_score(y_test, y_pred_proba)

print(accuracy,
      precision,
      recall,
      f1,
      auc
      )

Now it gives me the numbers below, which are clearly worse:

0.7809788654060067 0.5113236034222446 0.5044687189672294 0.5078730317420644 0.7589037004728368

Solution

I am able to reproduce your scenario with Pima Indians Diabetes Dataset.

The difference you see in the prediction metrics is not consistence and in some runs you may even notice the opposite, because it depends on the selection of the X_test during the split - some of the cases will be easier to predict and will give better metrics and vice versa. While Cross-validation runs predictions on the whole set you have in rotation and aggregates this effect, the single X_test set will suffer from effects of random splits.

In order to have better visibility on what is happening here, I have modified your experiment and split in two steps:

1. Cross-validation step:

I use the whole of the X and y sets and run rest of the code as it is

rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
# cv = KFold(n_splits=10)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
scores = cross_validate(rfc, X, y, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
      mean(scores['test_precision']),
      mean(scores['test_recall']),
      mean(scores['test_f1']),
      mean(scores['test_roc_auc'])
      )

Output:

0.768257006151743 0.6943032069967433 0.593436328663432 0.6357667086829574 0.8221242747913622

2. Classic train-test step:

Next I run the plain train-test step, but I do it 50 times with the different train_test splits, and average the metrics (similar to Cross-validation step).

accuracies = []
precisions = []
recalls = []
f1s = []
aucs = []

for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    RFC = RandomForestClassifier()

    RFC.fit(X_train, y_train)
    y_pred = RFC.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    y_pred_proba = RFC.predict_proba(X_test)[::, 1]
    auc = roc_auc_score(y_test, y_pred_proba)
    accuracies.append(accuracy)
    precisions.append(precision)
    recalls.append(recall)
    f1s.append(f1)
    aucs.append(auc)

print(mean(accuracies),
      mean(precisions),
      mean(recalls),
      mean(f1s),
      mean(aucs)
      )

Output:

0.7606926406926405 0.7001931059992001 0.5778712922956755 0.6306501622080503 0.8207846633339568

As expected the prediction metrics are similar. However, the Cross-validation runs much faster and uses each data point of the whole data set for testing (in rotation) by a given number of times.