Search code examples
pythonscikit-learn

Use sklearn transformers on list of columns and preserve the input columns


Using sklearn transformers, I'd like to be able to apply transformations to a list of columns and have the transformer create new columns with the transformed values rather than apply them to the existing ones. Is this possible? The transformer also needs to slot into a a Pipeline.

My goal is to compare the original columns and transformed columns. A wrapper class around the transformer could work, but I wonder if there's an easier way? Thank you.


Solution

  • The easiest way to do this would be to use a function with an argument that accepts a list of the features you would like to transform. From there you have 2 options:

    1. This is the method you requested. Since the function transforms only those features that you requested, you can replace the values of the features you would like to have transformed within the function.

    2. This is what I would recommend. Create a copy of the original dataframe and "paste" all the transformed features into it. You can then print the two dataframes in different cells (I'm assuming you're using jupyter notebooks) to compare the differences.

    This would be the function to use:

        def transform_data(scaler, df, feats_to_transform):
            features = scaled_df[feats_to_transform]
            transformed_feats = scaler.fit_transform(features.values) # The transformers take only 2d arrays
                          ​
            return transformed_feats 
    

    Method 1:

    df  = pd.read_csv('path/to/csv')
    scaler = StandardScaler() # from sklearn.preprocessing import StandardScaler
    
    feats_to_transform = ['feat1, feat2, feat3'] 
    transformed_feats = transform_data(scaler, df, feats_to_transform)
    
    df[feats_to_transform] = transformed_feats
    

    Method 2:

    df  = pd.read_csv('path/to/csv')
    scaled_df = df.copy(deep=True) # Using deep copy prevents alteration of original data
    scaler = StandardScaler() # from sklearn.preprocessing import StandardScaler
    
    feats_to_transform = ['feat1, feat2, feat3'] 
    transformed_feats = transform_data(scaler, df, feats_to_transform)
    
    scaled_df [feats_to_transform] = transformed_feats
    
    # now compare in different cells
    df.head()
    scaled_df.head()