Search code examples
pythonmachine-learningscikit-learnmulticlass-classification

Up-/downsampling with One vs. rest classifier


I have a data set (tf-idf weighted words) with multiple classes that I try to predict. My classes are imbalanced. I would like to use the One vs. rest classification approach with some classifiers (eg. Multinomial Naive Bayes) using the OneVsRestClassifier from sklearn.

Additionally, I would like to use the imbalanced-learn package (most likely one of the combinations of up- and downsampling) to enhance my data. The normal approach of using imbalanced-learn is:

from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)

I now have a data set with roughly the same number of cases for every label. I then would use the classifier on the resampled data.

from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
ovr = OneVsRestClassifier(MultinomialNB())
ovr.fit(X_resampled, y_resampled)

But: now there is a huge imbalance for every label when it's fitted, because I have in total more than 50 labels. Right? I imagine that I need to apply the up-/downsampling method for every label instead of doing it once at the beginning. How can I use the resampling for every label?


Solution

  • As per the discussion in comments, what you want can be done like this:

    from sklearn.naive_bayes import MultinomialNB
    from imblearn.combine import SMOTEENN
    
    # Observe how I imported Pipeline from IMBLEARN and not SKLEARN
    from imblearn.pipeline import Pipeline
    from sklearn.multiclass import OneVsRestClassifier
    
    # This pipeline will resample the data and  
    # pass the output to MultinomialNB
    pipe = Pipeline([('sampl', SMOTEENN()), 
                     ('clf', MultinomialNB())])
    
    # OVR will transform the `y` as you know and 
    # then pass single label data to different copies of pipe 
    # multiple times (as many labels in data)
    ovr = OneVsRestClassifier(pipe)
    ovr.fit(X, y)
    

    Explanation of code:

    • Step 1: OneVsRestClassifier will create multiple columns of y. One for each label, where that label is positive and all other are negative.

    • Step 2: For each label, OneVsRestClassifier will clone the supplied pipe estimator and pass the individual data to it.

    • Step 3:

      a. Each copy of pipe will get a different version of y, which is passed to SMOTEENN inside it and so will do a different sampling to balance the classes there.

      b. The second part of pipe (clf) will get that balanced dataset for each label as you wanted.

    • Step 4: During prediction time, the sampling part will be turned off, so the data will reach the clf as it is. The sklearn pipeline doesnt handle that part so thats why I used imblearn.pipeline.

    Hope this helps.