python pandas machine-learning scikit-learn data-preprocessing

How to build a pipeline finding the best preprocessing per column in a fine-grained fashion?

In sklearn we can use the column transformer within a pipeline to apply a preprocessing choice to specific columns like this:

import pandas as pd
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler, StandardScaler, ...
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV

# this is my x_data
x_data = pd.DataFrame(..., columns=['Variable1', 'Variable2', 'Variable3'])

pipeline = Pipeline(steps=[('preprocessing1', make_column_transformer((StandardScaler(), ['Variable1']),
                                                                       remainder='passthrough')),
                           ('preprocessing2', make_column_transformer((MaxAbsScaler(), ['Variable2']),
                                                                       remainder='passthrough')),
                           ('preprocessing3', make_column_transformer((MinMaxScaler(), ['Variable3']),
                                                                       remainder='passthrough')),
                           ('clf', MLPClassifier(...)
                          ]
                   )

then we would run the GridSearchCV something along the lines of the following:

params = [{'preprocessing1': [MinMaxScaler(), MaxAbsScaler(), StandardScaler()], # <<<<<<<<<<<<< How???
           'preprocessing2': [MinMaxScaler(), MaxAbsScaler(), StandardScaler()], # <<<<<<<<<<<<< How???
           'preprocessing3': [MinMaxScaler(), MaxAbsScaler(), StandardScaler()], # <<<<<<<<<<<<< How???
           'ann__hidden_layer_sizes': [(100,), (200,)],
           'ann__solver': ['adam', 'lbfs', 'sgd'],
            ...
          }]

cv = GridSearch(pipeline, params, cv=10, verbose=1, n_jobs=-1, refit=True)

What I would like to do, is to find the best preprocessing per predictor because usually one preprocessing for all predictors doesn't work best.

Solution

The naming convention in a pipeline is using double underscore __ to separate steps, and their parameters.

You can see the different parameter of your pipeline and their value using pipeline.get_params().

In your case the parameter preprocessing1__standardscaler is referencing the scaling preprocessing defined for the first step of your pipeline and this is the argument that should be set during the GridSearchCV.

The example below illustrates how to perform this operation:

from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier

X, y = make_classification(
    n_features=3, n_informative=3, n_redundant=0, random_state=42
)

pipeline = Pipeline(
    steps=[
        ("preprocessing1", make_column_transformer((StandardScaler(), [0]), remainder="passthrough")),
        ("preprocessing2", make_column_transformer((StandardScaler(), [1]), remainder="passthrough")),
        ("preprocessing3", make_column_transformer((StandardScaler(), [2]), remainder="passthrough")),
        ("clf", MLPClassifier()),
    ]
)

param_grid = {
    "preprocessing1__standardscaler": [StandardScaler(), MinMaxScaler(), MaxAbsScaler()],
    "preprocessing2__standardscaler": [StandardScaler(), MinMaxScaler(), MaxAbsScaler()],
    "preprocessing3__standardscaler": [StandardScaler(), MinMaxScaler(), MaxAbsScaler()],
}

grid_search = GridSearchCV(pipeline, param_grid, cv=10, verbose=1, n_jobs=-1)
grid_search.fit(X, y)
grid_search.best_params_

This will return the following output:

{'preprocessing1__standardscaler': MinMaxScaler(),
 'preprocessing2__standardscaler': StandardScaler(),
 'preprocessing3__standardscaler': MaxAbsScaler()}