Search code examples
pythonnormalizationcross-validationk-foldstandardization

What is the correct way to use standardization/normalization in combination with K-Fold Cross Validation?


I have always learned that standardization or normalization should be fit only on the training set, and then be used to transform the test set. So what I'd do is:

scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)

Now if I were to use this model on new data I could just save 'scaler' and load it to any new script.

I'm having trouble though understanding how this works for K-fold CV. Is it best practice to re-fit and transform the scaler on every fold? I could understand how this works on building the model, but what if I want to use this model later on. Which scaler should I save?

Further I want to extend this to time-series data. I understand how k-fold works for time-series, but again how do I combine this with CV? In this case I would suggest saving the very last scaler as this would be fit on 4/5th (In case of k=5) of the data, having it fit on the most (recent) data. Would that be the correct approach?


Solution

  • Is it best practice to re-fit and transform the scaler on every fold?

    Yes. You might want to read scikit-learn's doc on cross-validation:

    Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction.

    Which scaler should I save?

    Save the scaler (and any other preprocessing, i.e. a pipeline) and the predictor trained on all of your training data, not just (k-1)/k of it from cross-validation or 70% from a single split.

    • If you're doing a regression model, it's that simple.

    • If your model training requires hyperparameter search using cross-validation (e.g., grid search for xgboost learning parameters), then you have already gathered information from across folds, so you need another test set to estimate true out-of-sample model performance. (Once you have made this estimation, you can retrain yet again on combined train+test data. This final step is not always done for neural networks that are parameterized for a particular sample size.)