Search code examples
pythonmachine-learningscikit-learncross-validation

How to use Cross-Validation after transforming features


I have dataset with categorical and non categorical values. I applied OneHotEncoder for categorical values and StandardScaler for continues values.

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), ['A', 'B', 'C']),
                                                        ('StandardScaler', StandardScaler(), ['D', 'E'])],
                                          remainder='passthrough') # Default is to drop untransformed columns

Now I want to do cross-validation of my model, but the question is, should I transform my features and how can I do that? I mean, I need to transform my data because thats the only way to handle categorical values. I know that I should fit_transform my training data and only transform my test data, but how can I manage that in cross validation?

For now, I did this:

features = transformerVectoriser.fit_transform(features)

clf = RandomForestClassifier()
cv_score = cross_val_score(clf, features, results, cv=5)
print(cv_score)

But I think this is not correct because fit_transform will be applied in test fold and in train fold, and it should be fit_transform in training set and transform in test set. Should I just fit the data, or just transform the data, or something third?


Solution

  • desertnaut already teased the answer in his comment. I shall just explicate and complete:

    When you want to cross-validate several data processing steps together with an estimator, the best way is to use Pipeline objects. According to the user guide, a Pipeline serves multiple purposes, one of them being safety:

    Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

    With your definitions like above, you would wrap your transformations and classifier in a Pipeline the following way:

    from sklearn.pipeline import Pipeline
    
    
    pipeline = Pipeline([
        ('transformer', transformerVectoriser),
        ('classifier', clf)
    ])
    

    The steps in the pipeline can now be cross-validated togehter:

    cv_score = cross_val_score(pipeline, features, results, cv=5)
    print(cv_score)
    

    This will ensure that all transformers and the final estimator in the pipeline are only fit and transformed according to the training data, and only call the transform and predict methods on the test data in each iteration.

    If you want to read up more on the usage of Pipeline, check the documentation.