Search code examples
pythonmachine-learningscikit-learnpipelinenormalization

How to use StandardScaler inside a pipeline only on certain values?


I have a problem. I want to use StandardScaler(), but my dataset contains certain OneHotEncoding values and other values that should be not be scaled. But if I'm running the StandardScaler() all the values are scaled. So is there an option to run this method only on certain values inside a pipeline?

I found this question: One-Hot-Encode categorical variables and scale continuous ones simultaneouely with the below code

columns = ['rank']
columns_to_scale  = ['gre', 'gpa']

scaler = StandardScaler()
ohe    = OneHotEncoder(sparse=False)

# Concatenate (Column-Bind) Processed Columns Back Together
processed_data = np.concatenate([scaled_columns, encoded_columns], axis=1)

So is there an option to only run the StandardScaler() inside a pipeline on only certain values and the other values should be merged to the scaled values? So the pipeline should only use StandardScaler on the values 'xy', 'xyz'.

StandardScaler Class

from sklearn.base import BaseEstimator, TransformerMixin
class StandardScaler_with_certain_features(BaseEstimator, TransformerMixin):
    def __init__(self, columns_to_scale):
        scaler = StandardScaler()
        

    def fit(self, X, y = None):
        scaler.fit(X_train) # only std.fit on train set
        X_train_nor = scaler.transform(X_train.values)

    def transform(self, X, y = None):
        return X

Pipeline

columns_to_scale  = ['xy', 'xyz']
    
steps = [('standard_scaler', StandardScaler_with_certain_features(columns_to_scale)),
         ('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=100))),
         ('lasso', Lasso(alpha=0.03))]

pipeline = Pipeline(steps) 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

parameteres = { }

grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)                
grid.fit(X_train, y_train)
                    
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))

# Prediction
y_pred = grid.predict(X_test)
print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))

Solution

  • You can include a ColumnTransformer in the Pipeline in order to apply the StandardScaler only to certain columns. You need to set remainder='passthrough' to make sure that the columns that are not scaled are concatenated with the ones that are scaled.

    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import StandardScaler
    from sklearn.compose import ColumnTransformer
    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import Lasso
    
    df = pd.DataFrame({
        'y': np.random.normal(0, 1, 100),
        'x': np.random.normal(0, 1, 100),
        'z': np.random.normal(0, 1, 100),
        'xy': np.random.normal(2, 3, 100),
        'xyz': np.random.normal(4, 5, 100),
    })
    
    X = df.drop(labels=['y'], axis=1)
    y = df['y']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
    
    preprocessor = ColumnTransformer(
        transformers=[('scaler', StandardScaler(), ['xy', 'xyz'])],
        remainder='passthrough'
    )
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('lasso', Lasso(alpha=0.03))
    ])
    
    pipeline.fit(X_train, y_train)
    pipeline.score(X_test, y_test)