Search code examples
pythonmachine-learningscikit-learnneuraxle

Is it possible to combine multiple pipeline into single estimator in Neuraxle or sklearn to create multi-output classifer and fit in one go


I want to create multi-output classifier. However, my problem is that the distribution of positive label for each output varied greatly e.g. for output 1 there are 2% positive label and for output 2 there are 20% positive label. So, I want to separate data sampling and model fitting for each output into multiple stream (multiple sub-pipeline) where each sub-pipeline perform oversampling separately, and hyperparameters both for oversampling and classifier are optimized separately too.

For example, supposed that I have

from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

X = # some input features array here
y = np.array([[0,1],
              [0,1],
              [0,0],
              [1,0],
              [0,0]]) # unbalance label distribution

y_1 = y[:, 0]
y_2 = y[:, 1]


param_grid_shared = {'oversampler__sampling_strategy': [0.2, 0.4, 0.5], 'logit__C': [1, 0.1, 0.01]}

pipeline_output_1 = Pipeline([('oversampler', SMOTE()), ('logit', LogisticRegression())])
grid_1 = GridSearchCV(pipeline_output_1, param_grid_shared)
grid_1.fit(X, y_1)

pipeline_output_2 = Pipeline([('oversampler', SMOTE()), ('logit', LogisticRegression())])
grid_2 = GridSearchCV(pipeline_output_2, param_grid_shared)
grid_2.fit(X, y_2)

And I want to combine them to create something like

multi_pipe = Pipeline([(Something to separate X and y into multiple streams)
                       ((pipe_1, pipeline_output_1),
                       (pipe_2, pipeline_output_2)), # 2 pipeline optimized separately
                       (Evaluate and select hyperparameters for each pipeline separately)
                       (Something to combine output from pipeline 1 and pipeline 2)
                      ]) 

in Neuraxle or Sklearn

MultiOutputClassifier definitely won't fit for this case, and I am not quite sure where to look for the solution now.


Solution

  • I created an issue with the following idea:

    pipe_1_with_oversampler_1 = Pipeline([
        Oversampler1().assert_has_services(DataRepository), Pipeline1()])
    pipe_2_with_oversampler_2 = Pipeline([
        Oversampler2().assert_has_services(DataRepository), Pipeline2()])
    
    multi_pipe = Pipeline([
        DataPreprocessingStep(),
        # Evaluate and select hyperparameters for each pipeline separately, but within one run, using `multi_pipe.fit(...)`: 
        FeatureUnion([
            AutoML(pipe_1_with_oversampler_1, **automl_args_1),
            AutoML(pipe_2_with_oversampler_2, **automl_args_2)
        ]),
        # And then combine output from pipeline 1 and pipeline 2 using feature union. 
        # Can do preprocessing and postprocessing as well.
        PostprocessingStep(),
    ])
    

    For this to work, the AutoML object could be refactored into a regular step, and therefore useable in place of one.