python machine-learning scikit-learn data-science gridsearchcv

Pipeline within GridSearch repeats more than expected

I want to perform a grid search cv on my dataframe.

In my pipeline I use a custom transformer to format data. Though when I print the shape of the data within my custom transformer it gets printed 11 times (the transformer is called 11 times)

I thought it should be printed 10 times since it transforms train and test dataframes, 5 times each since it is a cross validation. So 5 x 2 = 10.

But a 11th shape is displayed, which is actually the dimensions of the full df (not separated into train/test)

Do you know the reason of this 11th call?

Here is in part the code to understand the problem:

def binary_data(df):
    df.gender = df.gender.map({'Female': 0, 'Male': 1})
    print(df.shape)
    return df

pipeline = ColumnTransformer([('binarydata', FunctionTransformer(binary_data), ['gender'])])
param_grid = {}
search = GridSearchCV(pipeline, param_grid, scoring='accuracy')
search.fit(X, y)

EDIT: the refit=True (default) flag is actually the reason of the extra call

Solution

I have built an example to check the behavior

import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

values = [{'gender':'Female'} if i%2==0 else {'gender':'Male'} for i in range(100)]

X = pd.DataFrame(values)
y = [0 if i%2==0 else 1 for i in range(100)]

def binary_data(df):
    df.gender = df.gender.map({'Female': 0, 'Male': 1})
    print(df.shape)
    return df

columntransf = ColumnTransformer([('binarydata', FunctionTransformer(binary_data), ['gender'])])
model_pipeline = Pipeline([
    ('preprocessing', columntransf),
    ('classifier', LogisticRegression(solver='lbfgs'))
])
param_grid = {}
search = GridSearchCV(model_pipeline, param_grid, scoring='accuracy')
search.fit(X, y)

And yes, I obtain as you said, 11 print:

(80, 1)
(20, 1)
(80, 1)
(20, 1)
(80, 1)
(20, 1)
(80, 1)
(20, 1)
(80, 1)
(20, 1)
(100, 1)

But, can you see the size of the last set? It's the size of all the dataset.

You forgot what's the main objective of a machine learning model. To learn from a dataset. From all the data from your dataset.

What you are trying to do with CrossValidation is to get an estimate of your model performance while searching for best hyperparameters with grid search

To make it more clear, cv is used to evaluate how well your model with your set of parameters, and after that, your total dataset, with best parameters, is used for the learning.

Another observation: How the method .predict() would be executed otherwise? We need only one model at the end, not five of them to make a prediction

The model used in the end, fitted on all the dataset is the one that you can extract from:

search.best_estimator_

In the general case, that's the reason why we hold out a test set from the dataset. To assess if our model will generalize well

From scikit-learn:

3.1. Cross-validation: evaluating estimator performance

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting.