Search code examples
pythonscikit-learnpipelinefeature-selectionsmote

how to use SMOTE & feature selection together in sklearn pipeline?


from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE    
smt = SMOTE(random_state=0)

pipeline_rf_smt_fs = Pipeline(
    [
        ('preprocess',preprocessor),
        ('selector', SelectKBest(mutual_info_classif, k=30)),
         ('smote',smt),        
        ('rf_classifier',RandomForestClassifier(n_estimators=600, random_state =2021))
    ]
)

i am getting below error: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(random_state=0)' (type <class 'imblearn.over_sampling._smote.SMOTE'>) doesn't

I believe smote has to be use post feature selection process. Any help on this would be very helpful.


Solution

  • This is the error message given by scikit-learn's version of the pipeline. Your code, as is, should not produce this error, but you probably have run from sklearn.pipeline import Pipeline somewhere which has overwritten the Pipeline object.

    From a methodological point of view, I nonetheless find it questionable to use a sampler after the preprocessing and feature selection in a general setting. What if the features you select are relevant because of the imbalance in your dataset? I would prefer using it in the first step of a pipeline (but this is up to you, it should not cause any errors).