Search code examples
pythonscikit-learnpipelinegrid-search

How to apply StandardScaler on objective variable when using sklearn pipeline and GridSearch?


I want to train ML model on Ames dataset using sklearn pipeline as below:

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import VarianceThreshold  


# -----------------------------------------------------------------------------
# Data
# -----------------------------------------------------------------------------

# Ames 
X, y = fetch_openml(name="house_prices", as_frame=True, return_X_y=True)

# In this dataset, categorical features have "object" or "non-numerical" data-type. 
numerical_features = X.select_dtypes(include='number').columns.tolist()   # 37
categorical_features = X.select_dtypes(exclude='number').columns.tolist()   # 43

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)


# -----------------------------------------------------------------------------
# Data preprocessing
# -----------------------------------------------------------------------------

numerical_preprocessor = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', MinMaxScaler())
])


categorical_preprocessor = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
    

preprocessor = ColumnTransformer(transformers=[
    ('number', numerical_preprocessor, numerical_features),
    ('category', categorical_preprocessor, categorical_features)
], 
        verbose_feature_names_out=True,   
)


# -----------------------------------------------------------------------------
# Pipeline
# -----------------------------------------------------------------------------

model = Pipeline(
    [
        ("preprocess", preprocessor),
        ('selector', VarianceThreshold(0.0)),
        ("regressor", GradientBoostingRegressor(random_state=0)),
    ]
)


_ = model.fit(X_train, y_train)


print(f"R2 train: {model.score(X_train, y_train):.3f}")   #  0.974   
print(f"R2 test : {model.score(X_test, y_test):.3f}")   # 0.861


# -----------------------------------------------------------------------------
# Grid Search
# -----------------------------------------------------------------------------

param_grid = {
    "regressor__n_estimators": [200, 500],
    "regressor__max_features": ["sqrt", "log2"],
    "regressor__max_depth": [4, 5, 6, 7, 8],
}

grid_search = GridSearchCV(model, param_grid=param_grid, cv=3)


grid_search.fit(X_train, y_train)

grid_search.score(X_test,y_test)  

What I am doing:

  • For numerical features : Imputing & MinMax scaling

  • For categorical features : Imputing & One-hot encoding

Next step is feature selection and fitting model.

However, I would like to apply StandardScaler on objective variable (y). I would like to know how I can include this step in sklearn pipeline to be sure that there is no leakage of information in cross-validation loop.

I mean, I want to fit_transform on train subset and transform on test subset during cross-validation (in hyper-parameter tuning).

I am not sure if I can use TransformedTargetRegressor since it has only fit method but StandardScaler should be fit_transform on train subset.


Solution

  • Yes, you answered the question yourself. TransformedTargetRegressor performs targets transformation with the Transformer/Pipeline your provide.

    This automatically performs a fit_transform in training phase on y. Then, the model is fitted on the transformed version of y, and then inverse_transformed to be given as model output.

    You are basically tranforming the target space (with an arbitrary operation, linear/nonlinear) and operating regression on it.

    This tranformation is done automatically by this wrapper. You can use it in your code like this:

    from sklearn.compose import TransformedTargetRegressor
    
    param_grid = {
        "regressor__regressor__n_estimators": [200, 500],
        "regressor__regressor__max_features": ["sqrt", "log2"],
        "regressor__regressor__max_depth": [4, 5, 6, 7, 8],
    }
    
    grid_search = GridSearchCV(
        TransformedTargetRegressor(
            regressor = model, 
            transformer = StandardScaler()
        ),
        param_grid=param_grid, cv=3)