python machine-learning scikit-learn pipeline normalization

How to use StandardScaler inside a pipeline only on certain values?

I have a problem. I want to use StandardScaler(), but my dataset contains certain OneHotEncoding values and other values that should be not be scaled. But if I'm running the StandardScaler() all the values are scaled. So is there an option to run this method only on certain values inside a pipeline?

I found this question: One-Hot-Encode categorical variables and scale continuous ones simultaneouely with the below code

columns = ['rank']
columns_to_scale  = ['gre', 'gpa']

scaler = StandardScaler()
ohe    = OneHotEncoder(sparse=False)

# Concatenate (Column-Bind) Processed Columns Back Together
processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)

So is there an option to only run the StandardScaler() inside a pipeline on only certain values and the other values should be merged to the scaled values? So the pipeline should only use StandardScaler on the values 'xy', 'xyz'.

StandardScaler Class

from sklearn.base import BaseEstimator, TransformerMixin
class StandardScaler_with_certain_features(BaseEstimator, TransformerMixin):
    def __init__(self, columns_to_scale):
        scaler = StandardScaler()
        

    def fit(self, X, y = None):
        scaler.fit(X_train) # only std.fit on train set
        X_train_nor = scaler.transform(X_train.values)

    def transform(self, X, y = None):
        return X

Pipeline

columns_to_scale  = ['xy', 'xyz']
    
steps = [('standard_scaler', StandardScaler_with_certain_features(columns_to_scale)),
         ('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=100))),
         ('lasso', Lasso(alpha=0.03))]

pipeline = Pipeline(steps) 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

parameteres = { }

grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)                
grid.fit(X_train, y_train)
                    
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))

# Prediction
y_pred = grid.predict(X_test)
print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))

Solution

You can include a ColumnTransformer in the Pipeline in order to apply the StandardScaler only to certain columns. You need to set remainder='passthrough' to make sure that the columns that are not scaled are concatenated with the ones that are scaled.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso

df = pd.DataFrame({
    'y': np.random.normal(0, 1, 100),
    'x': np.random.normal(0, 1, 100),
    'z': np.random.normal(0, 1, 100),
    'xy': np.random.normal(2, 3, 100),
    'xyz': np.random.normal(4, 5, 100),
})

X = df.drop(labels=['y'], axis=1)
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

preprocessor = ColumnTransformer(
    transformers=[('scaler', StandardScaler(), ['xy', 'xyz'])],
    remainder='passthrough'
)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('lasso', Lasso(alpha=0.03))
])

pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)