Use sklearn transformers on list of columns and preserve the input columns

Using sklearn transformers, I'd like to be able to apply transformations to a list of columns and have the transformer create new columns with the transformed values rather than apply them to the existing ones. Is this possible? The transformer also needs to slot into a a Pipeline.

My goal is to compare the original columns and transformed columns. A wrapper class around the transformer could work, but I wonder if there's an easier way? Thank you.

Solution

The easiest way to do this would be to use a function with an argument that accepts a list of the features you would like to transform. From there you have 2 options:

This is the method you requested. Since the function transforms only those features that you requested, you can replace the values of the features you would like to have transformed within the function.
This is what I would recommend. Create a copy of the original dataframe and "paste" all the transformed features into it. You can then print the two dataframes in different cells (I'm assuming you're using jupyter notebooks) to compare the differences.

This would be the function to use:

    def transform_data(scaler, df, feats_to_transform):
        features = scaled_df[feats_to_transform]
        transformed_feats = scaler.fit_transform(features.values) # The transformers take only 2d arrays
                      
        return transformed_feats

Method 1:

df  = pd.read_csv('path/to/csv')
scaler = StandardScaler() # from sklearn.preprocessing import StandardScaler

feats_to_transform = ['feat1, feat2, feat3'] 
transformed_feats = transform_data(scaler, df, feats_to_transform)

df[feats_to_transform] = transformed_feats

Method 2:

df  = pd.read_csv('path/to/csv')
scaled_df = df.copy(deep=True) # Using deep copy prevents alteration of original data
scaler = StandardScaler() # from sklearn.preprocessing import StandardScaler

feats_to_transform = ['feat1, feat2, feat3'] 
transformed_feats = transform_data(scaler, df, feats_to_transform)

scaled_df [feats_to_transform] = transformed_feats

# now compare in different cells
df.head()
scaled_df.head()