I want to train ML model on Ames dataset using sklearn pipeline as below:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import VarianceThreshold
# -----------------------------------------------------------------------------
# Data
# -----------------------------------------------------------------------------
# Ames
X, y = fetch_openml(name="house_prices", as_frame=True, return_X_y=True)
# In this dataset, categorical features have "object" or "non-numerical" data-type.
numerical_features = X.select_dtypes(include='number').columns.tolist() # 37
categorical_features = X.select_dtypes(exclude='number').columns.tolist() # 43
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
# -----------------------------------------------------------------------------
# Data preprocessing
# -----------------------------------------------------------------------------
numerical_preprocessor = Pipeline(steps=[
('impute', SimpleImputer(strategy='mean')),
('scale', MinMaxScaler())
])
categorical_preprocessor = Pipeline(steps=[
('impute', SimpleImputer(strategy='most_frequent')),
('one-hot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
preprocessor = ColumnTransformer(transformers=[
('number', numerical_preprocessor, numerical_features),
('category', categorical_preprocessor, categorical_features)
],
verbose_feature_names_out=True,
)
# -----------------------------------------------------------------------------
# Pipeline
# -----------------------------------------------------------------------------
model = Pipeline(
[
("preprocess", preprocessor),
('selector', VarianceThreshold(0.0)),
("regressor", GradientBoostingRegressor(random_state=0)),
]
)
_ = model.fit(X_train, y_train)
print(f"R2 train: {model.score(X_train, y_train):.3f}") # 0.974
print(f"R2 test : {model.score(X_test, y_test):.3f}") # 0.861
# -----------------------------------------------------------------------------
# Grid Search
# -----------------------------------------------------------------------------
param_grid = {
"regressor__n_estimators": [200, 500],
"regressor__max_features": ["sqrt", "log2"],
"regressor__max_depth": [4, 5, 6, 7, 8],
}
grid_search = GridSearchCV(model, param_grid=param_grid, cv=3)
grid_search.fit(X_train, y_train)
grid_search.score(X_test,y_test)
What I am doing:
For numerical features : Imputing & MinMax scaling
For categorical features : Imputing & One-hot encoding
Next step is feature selection and fitting model.
However, I would like to apply StandardScaler
on objective variable (y
). I would like to know how I can include this step in sklearn pipeline to be sure that there is no leakage of information in cross-validation loop.
I mean, I want to fit_transform
on train subset and transform
on test subset during cross-validation (in hyper-parameter tuning).
I am not sure if I can use TransformedTargetRegressor since it has only fit
method but StandardScaler
should be fit_transform
on train subset.
Yes, you answered the question yourself. TransformedTargetRegressor
performs targets transformation with the Transformer
/Pipeline
your provide.
This automatically performs a fit_transform
in training phase on y
.
Then, the model is fitted on the transform
ed version of y
, and then inverse_transform
ed to be given as model output.
You are basically tranforming the target space (with an arbitrary operation, linear/nonlinear) and operating regression on it.
This tranformation is done automatically by this wrapper. You can use it in your code like this:
from sklearn.compose import TransformedTargetRegressor
param_grid = {
"regressor__regressor__n_estimators": [200, 500],
"regressor__regressor__max_features": ["sqrt", "log2"],
"regressor__regressor__max_depth": [4, 5, 6, 7, 8],
}
grid_search = GridSearchCV(
TransformedTargetRegressor(
regressor = model,
transformer = StandardScaler()
),
param_grid=param_grid, cv=3)