When using sklearn LogisticRegression function for binary classification of imbalanced training dataset (e.g., 85% pos class vs 15% neg class), is there a difference between setting the class_weight argument to 'balanced' vs setting it to {0:0.15, 1:0.85} ? Based on the documentations, it appears to me that using the 'balanced' argument will do the same thing as providing the dictionary.
class_weight
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
Yes, it means the same. With the class_weight='balanced
parameter you don't need to pass the exact numbers and you can balance it automatically.
You can see a more extensive explanation in this link:
https://scikit-learn.org/dev/glossary.html#term-class-weight
To confirm that the similarity of the next attributes:
I have generated this experiment:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
clf_balanced = LogisticRegression(class_weight='balanced', random_state=0).fit(X, y)
clf_custom = LogisticRegression(class_weight={0:0.5,1:0.5}, random_state=0).fit(X, y)
clf_none = LogisticRegression(class_weight=None, random_state=0).fit(X, y)
print('Balanced:',clf_balanced.score(X, y))
print('Custom:',clf_custom.score(X, y))
print('None:',clf_none.score(X, y))
And the ouput is:
Balanced: 0.9733333333333334
Custom: 0.9733333333333334
None: 0.9733333333333334
So, we can conclude empirically that they are the same.