python machine-learning scikit-learn multiclass-classification

Up-/downsampling with One vs. rest classifier

I have a data set (tf-idf weighted words) with multiple classes that I try to predict. My classes are imbalanced. I would like to use the One vs. rest classification approach with some classifiers (eg. Multinomial Naive Bayes) using the OneVsRestClassifier from sklearn.

Additionally, I would like to use the imbalanced-learn package (most likely one of the combinations of up- and downsampling) to enhance my data. The normal approach of using imbalanced-learn is:

from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)

I now have a data set with roughly the same number of cases for every label. I then would use the classifier on the resampled data.

from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
ovr = OneVsRestClassifier(MultinomialNB())
ovr.fit(X_resampled, y_resampled)

But: now there is a huge imbalance for every label when it's fitted, because I have in total more than 50 labels. Right? I imagine that I need to apply the up-/downsampling method for every label instead of doing it once at the beginning. How can I use the resampling for every label?

Solution

As per the discussion in comments, what you want can be done like this:

from sklearn.naive_bayes import MultinomialNB
from imblearn.combine import SMOTEENN

# Observe how I imported Pipeline from IMBLEARN and not SKLEARN
from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier

# This pipeline will resample the data and  
# pass the output to MultinomialNB
pipe = Pipeline([('sampl', SMOTEENN()), 
                 ('clf', MultinomialNB())])

# OVR will transform the `y` as you know and 
# then pass single label data to different copies of pipe 
# multiple times (as many labels in data)
ovr = OneVsRestClassifier(pipe)
ovr.fit(X, y)

Explanation of code:

Step 1: OneVsRestClassifier will create multiple columns of y. One for each label, where that label is positive and all other are negative.
Step 2: For each label, OneVsRestClassifier will clone the supplied pipe estimator and pass the individual data to it.
Step 3:

a. Each copy of pipe will get a different version of y, which is passed to SMOTEENN inside it and so will do a different sampling to balance the classes there.

b. The second part of pipe (clf) will get that balanced dataset for each label as you wanted.
Step 4: During prediction time, the sampling part will be turned off, so the data will reach the clf as it is. The sklearn pipeline doesnt handle that part so thats why I used imblearn.pipeline.

Hope this helps.