Search code examples
pythonmachine-learningscikit-learngridsearchcvdata-preprocessing

How do I make sure GridSearchCV first does the cross split and then the imputing?


I have a GridSearchCV, with a pipeline that looks something like this:

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('scaler', StandardScaler())
])


preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
])

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(solver='lbfgs'))
])  

my GridSearchCV looks like this:

search = GridSearchCV(clf, param_grid, cv = 5, scoring = "roc_auc",error_score=0.0)

with Cross Validation = 5

So, how do I ensure that I split the data first, and then impute in the most frequent?


Solution

  • GridSearchCV will run roughly like this:

    for train_index, val_index in StratifiedKFold(n_splits=5).split(X, y):
        X_train, X_val = X[train_index], X[val_index]
        y_train, y_val = y[train_index], y[val_index]
    
        clf = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('classifier', LogisticRegression(solver='lbfgs'))
        ]) 
    
        clf.fit(X_train, y_train)
        clf.predict(X_val, y_val)
    

    You can be sure that SimpleImputer and StandardScaler will do .fit() and .transform() for each fold.