python machine-learning scikit-learn decision-tree grid-search

GridSearchCV best hyperparameters don't produce best accuracy

Using the UCI Human Activity Recognition dataset, I am trying to generate a DecisionTreeClassifier Model. With default parameters and random_state set to 156, the model returns the following accuracy:

dt_clf = DecisionTreeClassifier(random_state=156)
dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)
print('DecisionTree Accuracy Score: {0:.4f}'.format(accuracy_score(y_test, pred)))

Output:

DecisionTree Accuracy Score: 0.8548

With an arbitrary set of max_depth, I ran GridSearchCV to find its best parameters:

params = {
    'max_depth': [6, 8, 10, 12, 16, 20, 24]
}
grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1)
grid_cv.fit(X_train, y_train)
print('GridSearchCV Best Score: {0:.4f}'.format(grid_cv.best_score_))
print('GridSearchCV Best Params:', grid_cv.best_params_)

Output:

Fitting 5 folds for each of 7 candidates, totalling 35 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1
concurrent workers. [Parallel(n_jobs=1)]: Done  35 out of  35 |
elapsed:  1.6min finished GridSearchCV Best Score: 0.8513 GridSearchCV
Best Params: {'max_depth': 16}

Now, I wanted to test out the "best parameter" max_depth=16 on a separate test set to see if it truly was the best parameter among the provided list max_depth = [6, 8, 10, 12, 16, 20, 24].

max_depths = [6, 8, 10, 12, 16, 20, 24]
for depth in max_depths:
    dt_clf = DecisionTreeClassifier(max_depth=depth, random_state=156)
    dt_clf.fit(X_train, y_train)
    pred = dt_clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred)
    print('max_depth = {0} Accuracy: {1:.4f}'.format(depth, accuracy))

But to my surprise, the test showed that the "best parameter" max_depth=16 was no where close to being the best out of the bunch:

Output:

max_depth = 6 Accuracy: 0.8558
max_depth = 8 Accuracy: 0.8707
max_depth = 10 Accuracy: 0.8673
max_depth = 12 Accuracy: 0.8646
max_depth = 16 Accuracy: 0.8575
max_depth = 20 Accuracy: 0.8548
max_depth = 24 Accuracy: 0.8548

I understand that the best parameters from GridSearchCV are based on the mean test scores resulting from cross-validating the training set (X_train, y_train), but shouldn't it still be reflected on the test set to a certain extent? I presume the UCI datasets are not imbalanced so dataset bias shouldn't be an issue.

Solution

Your implicit assumption that the best hyperparameters found during CV should definitely produce the best results on an unseen test set is wrong. There is absolutely no guarantee whatsoever that something like that will happen.

The logic behind selecting hyperparameters this way is that it is the best we can do given the (limited) information we have at hand at the time of model fitting, i.e. it is the most rational choice. But the general context of the problem here is that of decision-making under uncertainty (the decision being indeed the choice of hyperparameters), and in such a context, there are no performance guarantees of any kind on unseen data.

Keep in mind that, by definition (and according to the underlying statistical theory), the CV results are not only biased on the specific dataset used, but even on the specific partitioning to training & validation folds; in other words, there is always the possibility that, using a different CV partitioning of the same data, you will end up with different "best values" for the hyperparameters involved - perhaps even more so when using an unstable classifier, such as a decision tree.

All this does not of course mean either that such a use of CV is useless or that we should spend the rest of our lives trying different CV partitions of our data, in order to be sure that we have the "best" hyperparameters; it simply means that CV is indeed a useful and rational heuristic approach here, but expecting any kind of mathematical assurance that its results will be optimal on unseen data is unfounded.