Search code examples
pythonscikit-learncross-validationgrid-searchgridsearchcv

Preprocessing on GridsearchCV


I'm using GridsearchCV for tuning hyperparameters and now I want to do a min-max Normalization(StandardScaler()) in training and validating step.But I think I cannot do this.

The question is :

  1. If I apply preprocess step on whole training set and send it to GridsearchCV for do 10 foldCV. This gonna lead me to data leakage right? because the training set will running 10 folds this mean 9 folds for train and 1 fold for test fold. the Normalization should apply on only training set not validation set right?
  2. If I use sklearn's Pipeline it won't solve this problem right? because it runs only once and lead me to data leakage again.
  3. Is there other way to do this and still using the GridsearchCV for tuning the parameters

Solution

  • Indeed this will cause a data-leak, it's very good that you caught it !

    A solution to this using a pipeline, is to make a pipeline with StandardScaler as the first operation in the pipeline, and then your Classifier of choice and eventually pass this pipeline to the GridSearchCV

    clf = make_pipeline(StandardScaler(), 
                        MyClassifier())
    grid_search = GridSearchCV(clf, refit=True)
    

    For more info, check this article here