Search code examples
scikit-learnmemory-leakspipelinegridsearchcvdata-preprocessing

GridSearchCV, Data Leaks & Production Process Clarity


I've read a bit about integrating scaling with cross-fold validation and hyperparameter tuning without risking data leaks. The most sensical solution I've found (according to my knowledge) involves creating a pipeline that includes the scalar and GridSeachCV, for when you want to grid search and cross-fold validate. I've also read that, even when using cross-fold validation, it is useful to, at the very beginning, create a hold-out test set for an additional, final evaluation of your model after hyperparameter tuning. Putting that all together looks like this:

# train, test, split, unscaled data to create a final test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# instantiate pipeline with scaler and model, so that each training set
# in each fold is fit to the scalar and each training/test set in each fold 
# is respectively transformed by fit scalar, preventing data leaks between each test/train

pipe = Pipeline([('sc', StandardScaler()),  
                 ('knn', KNeighborsClassifier())
                 ])

# define hypterparameters to search
params = {'knn_n_neighbors': [3, 5, 7, 11]}

# create grid
search = GridSearchCV(estimator=pipe, 
                      param_grid=params, 
                      cv=5, 
                      return_train_Score=True)
    
search.fit(X_train, y_train)

Assuming my understanding and the above process is correct, my question is what's next?

My guess is that we:

  1. fit X_train to our scaler
  2. transform X_train and X_test with our scaler
  3. train a new model using X_train and our newly discovered best parameters from the grid search process
  4. test the new model with our very first holdout-test set.

Presumably, because the Gridsearch evaluated models with scaling based on various slices of the data, the difference in values from scaling our final, whole train and test data should be fine.

Finally, when it is time to process completely new data points through our production model, do those datapoints need to be transformed according to the scalar fit to our original X_train?

Thank you for any help. I hope I am not completely misunderstanding fundamental aspects of this process.

Bonus Question: I've seen example code like above from a number of sources. How does pipeline know to fit the scalar to the crossfold's training data, then transform the training and test data? Usually we have to define that process:

# define the scaler
scaler = MinMaxScaler()

# fit on the training dataset
scaler.fit(X_train)

# scale the training dataset
X_train = scaler.transform(X_train)

# scale the test dataset
X_test = scaler.transform(X_test)

Solution

  • GridSearchCV will help you find the best set of hyperparameter according to your pipeline and dataset. In order to do that it will use cross validation (split the your train dataset into 5 equal subset in you case). This means that your best_estimator will be trained on 80% of the train set.

    As you know the more data a model see, the better its result is. Therefore once you have the optimal hyperparameters, it is wise to retrain the best estimator on all your training set and assess its performance with the test set.

    You can retrain the best estimator using the whole train set by specifying the parameter refit=True of the Gridsearch and then score your model on the best_estimator as follows:

    search = GridSearchCV(estimator=pipe, 
                          param_grid=params, 
                          cv=5,
                          return_train_Score=True,
                          refit=True)
        
    search.fit(X_train, y_train)
    tuned_pipe = search.best_estimator_
    tuned_pipe.score(X_test, y_test)