Search code examples
pythonmachine-learningscikit-learndata-sciencegridsearchcv

Pipeline within GridSearch repeats more than expected


I want to perform a grid search cv on my dataframe.

In my pipeline I use a custom transformer to format data. Though when I print the shape of the data within my custom transformer it gets printed 11 times (the transformer is called 11 times)

I thought it should be printed 10 times since it transforms train and test dataframes, 5 times each since it is a cross validation. So 5 x 2 = 10.

But a 11th shape is displayed, which is actually the dimensions of the full df (not separated into train/test)

Do you know the reason of this 11th call?

Here is in part the code to understand the problem:

def binary_data(df):
    df.gender = df.gender.map({'Female': 0, 'Male': 1})
    print(df.shape)
    return df

pipeline = ColumnTransformer([('binarydata', FunctionTransformer(binary_data), ['gender'])])
param_grid = {}
search = GridSearchCV(pipeline, param_grid, scoring='accuracy')
search.fit(X, y)

EDIT: the refit=True (default) flag is actually the reason of the extra call


Solution

  • I have built an example to check the behavior

    import pandas as pd
    from sklearn.model_selection import GridSearchCV
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.linear_model import LogisticRegression
    from sklearn.pipeline import Pipeline
    
    values = [{'gender':'Female'} if i%2==0 else {'gender':'Male'} for i in range(100)]
    
    X = pd.DataFrame(values)
    y = [0 if i%2==0 else 1 for i in range(100)]
    
    def binary_data(df):
        df.gender = df.gender.map({'Female': 0, 'Male': 1})
        print(df.shape)
        return df
    
    columntransf = ColumnTransformer([('binarydata', FunctionTransformer(binary_data), ['gender'])])
    model_pipeline = Pipeline([
        ('preprocessing', columntransf),
        ('classifier', LogisticRegression(solver='lbfgs'))
    ])
    param_grid = {}
    search = GridSearchCV(model_pipeline, param_grid, scoring='accuracy')
    search.fit(X, y) 
    

    And yes, I obtain as you said, 11 print:

    (80, 1)
    (20, 1)
    (80, 1)
    (20, 1)
    (80, 1)
    (20, 1)
    (80, 1)
    (20, 1)
    (80, 1)
    (20, 1)
    (100, 1)
    

    But, can you see the size of the last set? It's the size of all the dataset.

    You forgot what's the main objective of a machine learning model. To learn from a dataset. From all the data from your dataset.

    What you are trying to do with CrossValidation is to get an estimate of your model performance while searching for best hyperparameters with grid search

    To make it more clear, cv is used to evaluate how well your model with your set of parameters, and after that, your total dataset, with best parameters, is used for the learning.

    Another observation: How the method .predict() would be executed otherwise? We need only one model at the end, not five of them to make a prediction

    The model used in the end, fitted on all the dataset is the one that you can extract from:

    search.best_estimator_
    

    In the general case, that's the reason why we hold out a test set from the dataset. To assess if our model will generalize well

    From scikit-learn:

    3.1. Cross-validation: evaluating estimator performance

    Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.