Search code examples
feature-engineeringscikit-learn-pipeline

New Feature in Scikit-Learn Pipeline - Interaction between two existing Features


I have two features in my data set: height and Area. I want to create a new feature by Interacting Area and Height using pipeline in scikit-learn.

Can anyone please guide me on how I can achieve this?

Thanks


Solution

  • You can achieve this with a custom transformer, implementing a fit and transform method. Optionnaly you can make it inherit from sklearn TransformerMixin for bullet-profing.

    from sklearn.base import TransformerMixin
    
    class CustomTransformer(TransformerMixin):
        def fit(self, X, y=None):
            """The fit method doesn't do much here, 
               but it still required if your pipeline
               ever need to be fit. Just returns self."""
            return self
    
        def transform(self, X, y=None):
            """This is where the actual transformation occurs.
               Assuming you want to compute the product of your feature
               height and area.
            """
            # Copy X to avoid mutating the original dataset
            X_ = X.copy()
            # change new_feature and right member according to your needs
            X_["new_feature"] = X_["height"] * X_["area"]
            # you then return the newly transformed dataset. It will be 
            # passed to the next step of your pipeline
            return X_
    

    You can test it with this code :

    import pandas as pd
    from sklearn.pipeline import Pipeline
    
    # Instantiate fake DataSet, your Transformer and Pipeline
    X = pd.DataFrame({"height": [10, 23, 34], "area": [345, 33, 45]})
    custom = CustomTransformer()
    pipeline = Pipeline([("heightxarea", custom)])
    
    # Test it
    pipeline.fit(X)
    pipeline.transform(X)
    

    For such a simple processing, it might seem like an overkill, but it is a good practice to put any dataset manipulations into Transformers. They are more reproducible that way.