I have a data set (tf-idf weighted words) with multiple classes that I try to predict. My classes are imbalanced. I would like to use the One vs. rest classification approach with some classifiers (eg. Multinomial Naive Bayes) using the OneVsRestClassifier from sklearn.
Additionally, I would like to use the imbalanced-learn package (most likely one of the combinations of up- and downsampling) to enhance my data. The normal approach of using imbalanced-learn is:
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_resample(X, y)
I now have a data set with roughly the same number of cases for every label. I then would use the classifier on the resampled data.
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
ovr = OneVsRestClassifier(MultinomialNB())
ovr.fit(X_resampled, y_resampled)
But: now there is a huge imbalance for every label when it's fitted, because I have in total more than 50 labels. Right? I imagine that I need to apply the up-/downsampling method for every label instead of doing it once at the beginning. How can I use the resampling for every label?
As per the discussion in comments, what you want can be done like this:
from sklearn.naive_bayes import MultinomialNB
from imblearn.combine import SMOTEENN
# Observe how I imported Pipeline from IMBLEARN and not SKLEARN
from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
# This pipeline will resample the data and
# pass the output to MultinomialNB
pipe = Pipeline([('sampl', SMOTEENN()),
('clf', MultinomialNB())])
# OVR will transform the `y` as you know and
# then pass single label data to different copies of pipe
# multiple times (as many labels in data)
ovr = OneVsRestClassifier(pipe)
ovr.fit(X, y)
Explanation of code:
Step 1: OneVsRestClassifier
will create multiple columns of y
. One for each label, where that label is positive and all other are negative.
Step 2: For each label, OneVsRestClassifier
will clone the supplied pipe
estimator and pass the individual data to it.
Step 3:
a. Each copy of pipe
will get a different version of y
, which is passed to SMOTEENN
inside it and so will do a different sampling to balance the classes there.
b. The second part of pipe
(clf
) will get that balanced dataset for each label as you wanted.
Step 4: During prediction time, the sampling part will be turned off, so the data will reach the clf
as it is. The sklearn pipeline doesnt handle that part so thats why I used imblearn.pipeline
.
Hope this helps.