Search code examples
scikit-learngridsearchcvimbalanced-dataimblearnsmote

Not able to feed the combined SMOTE & RandomUnderSampler pipeline into the main pipeline


I am currently working with an Imbalanced datatset, and inorder to handle Imbalance, I plan on combining SMOTE and ADASYN with RandomUnderSampler, and also indivitual undersampling, oversampling, SMOTE & ADASYN (A total of 6 sampling ways, which I will pass as a paramenter in GridSearchCV). I created two pipelines for this.

Smote_Under_pipeline = imb_Pipeline([
     ('smote', SMOTE(random_state=rnd_state, n_jobs=-1)),
     ('under', RandomUnderSampler(random_state=rnd_state)),
])

Adasyn_Under_pipeline = imb_Pipeline([
     ('adasyn', ADASYN(random_state=rnd_state, n_jobs=-1)),
     ('under', RandomUnderSampler(random_state=rnd_state)), 
])

My plan is to feed this two pipleines into the main pipeline, which is like this:

Main_Pipeline = imb_Pipeline([
     ('feature_handler', FeatureTransformer(list(pearson_feature_vector.index))),
     ('imb', Smote_Under_pipeline),
     ('scaler', StandardScaler()),
     ('pca', PCA(n_components=0.99)),
     ('model', LogisticRegression(max_iter=1750)),
])

The FeatureTransformer() is a feature selector class:

class FeatureTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, feature_vector=None):
        self.feature_vector = feature_vector
    
    def fit(self, X, y):
        return self

    def transform(self, X):
        return X[self.feature_vector]

When I call Smote_Under_pipeline.fit() or Adasyn_Under_pipeline.fit(), It works (sample code below):

dumm_x, dumm_y = Smote_Under_pipeline.fit_resample(X_train, y_train)

But when I try to initialize Main_Pipeline at that time interpreter throws an error:

TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample. 'Pipeline(steps=[('smote', SMOTE(n_jobs=-1, random_state=42)),
            ('under', RandomUnderSampler(random_state=42))])' implements both)

I am using pipelines provided by Imbalance-learn.

I am not able to understand the error. While using scikit-learn pipelines all the intermediate estimators have their own fit() & fit_transform() methods, The imblearn pipelines give an additionally functionality of handling fit_resample() method, which is being exposed by both: Smote_Under_pipeline & Adasyn_Under_pipeline. So, it can be called in the Main_Pipeline, then why is the error being thrown? Both the sampling pipelines are exposing fit() method as well along with fit_resample(), is this the cause?


Solution

  • To emphasize @glemaitre's comment, it's the pipeline (the inner one) that has both transform and resampling that's causing the problem.

    So flattening the pipeline (including the resamplers directly in the main pipeline) seems to be the solution. You may be able to test the different resampling strategies as hyperparameters still, by turning off individual steps:

    Main_Pipeline = imb_Pipeline([
         ('feature_handler', FeatureTransformer(list(pearson_feature_vector.index))),
         ('oversamp', None),
         ('undersamp', None),
         ('scaler', StandardScaler()),
         ('pca', PCA(n_components=0.99)),
         ('model', LogisticRegression(max_iter=1750)),
    ])
    
    param_space = {
        'oversamp': [None, SMOTE(...), ADASYN(...), RandomOverSampler(...)],
        'undersamp': [None, RandomUnderSampler(...)],
        ...,
    }
    

    That will give 8 combinations, including the None-None and over-undersample in addition to those you wanted. But that seems OK to me: it'll be nice to have the comparison to the no-resampling pipeline, and over-undersampling is similar to the synth-undersampling combinations.