Search code examples
pythonpipelinetransformer-model

Python Pipeline Custom Transformer


I am trying to code a custom transformer to be used in a pipeline to pre-process data.

Here is the code I'm using (sourced - not written by me). It takes in a dataframe, scales the features, and returns a dataframe:

class DFStandardScaler(BaseEstimator,TransformerMixin):

    def __init__(self):

        self.ss = None

    def fit(self,X,y=None):

        self.ss = StandardScaler().fit(X)
        return self

    def transform(self, X):

        Xss = self.ss.transform(X)
        Xscaled = pd.DataFrame(Xss, index=X.index, columns=X.columns)
        return Xscaled

I have data that has both categorical and continuous features. Obviously the transformer will not transform the categorical feature ('sex'). When I fit this pipeline with the dataframe below it throws an error because it is trying to scale the categorical labels in 'sex':

     sex  length  diameter  height  whole_weight  shucked_weight  \
0      M   0.455     0.365   0.095        0.5140          0.2245   
1      M   0.350     0.265   0.090        0.2255          0.0995   
2      F   0.530     0.420   0.135        0.6770          0.2565   
3      M   0.440     0.365   0.125        0.5160          0.2155   
4      I   0.330     0.255   0.080        0.2050          0.0895   
5      I   0.425     0.300   0.095        0.3515          0.1410   

How do I pass a list of categorical / continuous features into the transformer so it will scale the proper features? Or is it better to somehow code the feature type check inside the transformer?


Solution

  • Basically you need another step in the Pipeline with a similar class inheriting from BaseEstimator and TransformerMixin

    class ColumnSelector(BaseEstimator,TransformerMixin):
        def __init__(self, columns: list):
            self.cols = columns
    
        def fit(self,X,y=None):
            return self
    
        def transform(self, X, y=None):
            return X.loc[:, self.cols]
    

    Then in your main the pipeline looks like this:

    selector = ColumnSelector(['length', 'diameter', 'height', 'whole_weight', 'shucked_weight'])
    pipe = pipeline.make_pipeline(
        selector,
        DFStandardScaler()
    )
    
    pipe2 = pipeline.make_pipeline(#some steps for the sex column)
    
    full_pipeline = pipeline.make_pipeline(
        pipeline.make_union(
            pipe,
            pipe2
        ),
        #some other step
    )