Search code examples
pythondataframemachine-learningscikit-learnpipeline

Scikitlearn machine learning pipeline with passthrough parameters


I have implemented 3 TransformerMixin classes in an attempt to make my own scikitlearn Pipeline. However, I am unable to combine them since PrepareModel object uses information from FeatureEngineering object. In particular, consider:

cleaner = DataCleaner()
df_clean = cleaner.fit_transform(df)
engineering = FeatureEngineering()
df_engineered = engineering.fit_transform(df_clean)
modelprep = PrepareModel(engineering.des_features)
X = modelprep.fit_transform(df_engineered)

Note that each of DataCleaner, FeatureEngineering, PrepareModel are child classes of TransformerMixin.

How would I make a Pipeline with this setup?

from sklearn.pipeline import Pipeline  
full_pipeline = Pipeline([('cleaner', DataCleaner()), 
                          ('engineering', FeatureEngineering()),
                          ('prepare', PrepareModel())])

The issue I have is that the third step needs the des_features from the second step? So this does not work. How would I make this work?


Solution

  • This isn't currently easy to do; it's probably another use-case for the "metadata routing" SLEP006.

    In this example, since you own all the transformers, you might be able to hack something together by just attaching an attribute to the output dataset:

    class FeatureEngineering(...):
        ...
    
        def transform(self, X):
            ...
            return_value.metadata = self.des_features
            return return_value
    
    class PrepareModel(...):
        ...
    
        def fit(self, X, y=None):
            self.des_features = X.metadata
            ...