python machine-learning data-science cross-validation hyperparameters

Why is my mean test score at parameter tuning (cv) lower than on hold out test set (RandomForestClassifier)?

I'm doing hyperparameter tuning using RandomizedSearchCV (sklearn) with a 3 fold cross validation on my training set. After that I'm checking my score (accuracy, recall_weighted, cohen_kappa) on the test set. Surprisingly its always a bit higher than the best_score attribute of my RandomizedSearchCV.

In the beginning I'm performing a stratified data split into 70/30 training and testing sets.

My dataset includes 12 classes, 12 features and is imbalanced. I have ~3k datapoints.

Is it normal (or not too surprising) when I compare the cross validation training score of the parameter tuning with the score on a hold out test set?

I already tried different random seeds for the initial split and different scoring methods (accuracy, recall_macro, recall_weighted, cohen_kappa).

Here is my code:

#Split data in training and test set (70/30 stratified split)
x_train, x_test, y_train, y_test = train_test_split(X_Distances, Y, test_size=0.3, random_state=42, stratify=Y, shuffle=True)

#Scorings used for parameter tuning evaluation
scoring = {'Accuracy' : make_scorer(accuracy_score), 'Recall' : 'recall_weighted', 'Kappa' : make_scorer(cohen_kappa_score)}

#Initializing of parameter ranges
params_randomSearch = {"min_samples_leaf": np.arange(1,30,2),
              "min_samples_split": np.arange(2,20,2),
              "max_depth": np.arange(2, 20, 2),
              "min_weight_fraction_leaf": np.arange(0. ,0.4, 0.1),
              "n_estimators": np.arange(10, 1000, 100),
              "max_features" : ['auto', 'sqrt', 'log2', None],
              "criterion" : ['entropy', 'gini']}

#Perform RandomSearchCV over a wide range of possible parameters
if __name__ == '__main__':
    rs = RandomizedSearchCV(RandomForestClassifier(random_state=42), param_distributions=params_randomSearch, scoring = scoring, cv = 3, refit = 'Recall', n_iter=60, n_jobs=-1, random_state=42)
    rs.fit(x_train, y_train)
    print('Best Score: ', rs.best_score_, '\nBest parameters: ', rs.best_params_)
    y_predict = rs.best_estimator_.predict(x_test)
    acc = recall_score(y_test, y_predict, average='weighted')

Results for recall_weighted:

# RandomizedSearchCV:
best_params_ = {dict} {'n_estimators': 310, 'min_weight_fraction_leaf': 0.0, 'min_samples_split': 12, 'min_samples_leaf': 5, 'max_features': 'auto', 'max_depth': 14, 'criterion': 'entropy'}
best_score_ = {float64} 0.5103216514642342

# Hold out test set:
0.5666293393057111

I want to use the hold-out test set to compare how different algorithms work on this data set.

Question: Is there an error in my approach leading to this difference in score or can I ignore it and how should I interpret it?

Solution

As far as i can see everything is as expected.

best_score_ gives you the average score of 3 folds for the best estimator:

Each fold contains ~1.386 training samples: 3.000 * 0.7 (train size) * 2/3 (cv train size).

Then you fit the best estimator (This is caused by "refit" parameter of RandomizedSearchCV) on entire x_train, which has ~2.100 samples: 3.000 * 0.7, which is much more.

You can try, for example, cv=5 for your search and you would probably see that the score difference decreases.

Also the more data you have - the more representative the CV score is. Maybe for this particular project 3000 samples is not quite enough.