machine-learning scikit-learn cross-validation grid-search

Using Validation Set in GridSearchCV/RandomizedCV or not?

As I know, cross validation (in GridSearchCV/RandomizedSearchCV) will split data into folds, in which each fold acts as a validataion set one time. But one recommendation from sklearn:

Model selection by evaluating various parameter settings can be seen as a way to use the labeled data to “train” the parameters of the grid. When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process: it is recommended to split the data into a development set (to be fed to the GridSearchCV instance) and an evaluation set to compute performance metrics. This can be done by using the train_test_split utility function.

So we may use "train_test_split" to split original data to train data and valid data

X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.25)

And using X_train,y_train in fit GridSearchCV/RandomizedSearchCV, and X_val, y_val for eval_set in fit_params.

Is It really useful?

We apply split orginal data twice( SearchCV and train_test_split) --> necessary?

Data apply in SearchCV is less (X vs X_train) --> less accuracy in training?

Solution

The documentation here refers to evaluation set as the test set. Therefore you should use train_test_split to split your data into a train set and a test set.

This is useful to perform this train_test_split as you will then be able to validate the result of your model using the test set that contains unseen data.

The train set will be use during the GridSearchCV to find the best parameter for your model. As explained in the documentation you can use cv parameter to train your model using n-1 fold and validate it with 1 fold.

I would recommend to use cross validation set during GridSearchCV instead of having a fix validation set as this will give you a better indication on how your model perform on unseen data.