Search code examples
scikit-learnlogistic-regressionimbalanced-data

Logistic Regression - class_weight balanced vs dict argument


When using sklearn LogisticRegression function for binary classification of imbalanced training dataset (e.g., 85% pos class vs 15% neg class), is there a difference between setting the class_weight argument to 'balanced' vs setting it to {0:0.15, 1:0.85} ? Based on the documentations, it appears to me that using the 'balanced' argument will do the same thing as providing the dictionary.

class_weight

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).


Solution

  • Yes, it means the same. With the class_weight='balanced parameter you don't need to pass the exact numbers and you can balance it automatically.

    You can see a more extensive explanation in this link:

    https://scikit-learn.org/dev/glossary.html#term-class-weight

    To confirm that the similarity of the next attributes:

    • class_weight = 'balanced'
    • class_weight = {0:0.5, 1:0.5}
    • class_weight = None

    I have generated this experiment:

    from sklearn.datasets import load_iris
    from sklearn.linear_model import LogisticRegression
    
    X, y = load_iris(return_X_y=True)
    clf_balanced = LogisticRegression(class_weight='balanced', random_state=0).fit(X, y)
    clf_custom = LogisticRegression(class_weight={0:0.5,1:0.5}, random_state=0).fit(X, y)
    clf_none = LogisticRegression(class_weight=None, random_state=0).fit(X, y)
    
    print('Balanced:',clf_balanced.score(X, y))
    print('Custom:',clf_custom.score(X, y))
    print('None:',clf_none.score(X, y))
    

    And the ouput is:

    Balanced: 0.9733333333333334
    Custom:   0.9733333333333334
    None:     0.9733333333333334
    

    So, we can conclude empirically that they are the same.