scikit-learn logistic-regression imbalanced-data

Logistic Regression - class_weight balanced vs dict argument

When using sklearn LogisticRegression function for binary classification of imbalanced training dataset (e.g., 85% pos class vs 15% neg class), is there a difference between setting the class_weight argument to 'balanced' vs setting it to {0:0.15, 1:0.85} ? Based on the documentations, it appears to me that using the 'balanced' argument will do the same thing as providing the dictionary.

class_weight

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

Solution

Yes, it means the same. With the class_weight='balanced parameter you don't need to pass the exact numbers and you can balance it automatically.

You can see a more extensive explanation in this link:

https://scikit-learn.org/dev/glossary.html#term-class-weight

To confirm that the similarity of the next attributes:

class_weight = 'balanced'
class_weight = {0:0.5, 1:0.5}
class_weight = None

I have generated this experiment:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

X, y = load_iris(return_X_y=True)
clf_balanced = LogisticRegression(class_weight='balanced', random_state=0).fit(X, y)
clf_custom = LogisticRegression(class_weight={0:0.5,1:0.5}, random_state=0).fit(X, y)
clf_none = LogisticRegression(class_weight=None, random_state=0).fit(X, y)

print('Balanced:',clf_balanced.score(X, y))
print('Custom:',clf_custom.score(X, y))
print('None:',clf_none.score(X, y))

And the ouput is:

Balanced: 0.9733333333333334
Custom:   0.9733333333333334
None:     0.9733333333333334

So, we can conclude empirically that they are the same.