When working on the imbalanced dataset, I found an interesting question about LogisticRegression
in scikit-learn.
For the parameter class_weight
, if I send {1:0.5, 0:0.5}
, I will get a different outcome with {1:1, 0:1}
even though they are actually the same weights mathematically.
Here is what I got,
import numpy as np
from sklearn.linear_model import LogisticRegression
np.random.seed(1)
def sigmoid(x):
return 1/(np.exp(-x)+1)
x1 = np.random.normal(0, 4, 100000)
x2 = np.random.normal(0, 1, 100000)
X = np.array([x1, x2]).T
proba = sigmoid(0.1 + 2*x1 + 3*x2)
y = np.random.binomial(1, proba)
lr1 = LogisticRegression(C=1, class_weight = {1:0.5, 0:0.5}).fit(X, y)
print(lr1.score(X,y)) # 0.93656
lr2 = LogisticRegression(C=1, class_weight = {0:1, 1:1}).fit(X, y)
print(lr2.score(X,y)) # 0.93653
class_weight
parameter works actually and why it happens?class_weight
properly?The way class_weight is implemented is that it affects sample_weight, and these, on the other hand multiply the loss. Unfortunately they do not affect the regulariser, so its relative strength changes
lr2 = LogisticRegression(C=0.5, class_weight = {0:1, 1:1}).fit(X, y)
will give you desired
print(lr2.score(X,y)) # 0.93656
and analogously
lr2 = LogisticRegression(C=0.25, class_weight = {0:2, 1:2}).fit(X, y)
print(lr2.score(X,y)) # 0.93656
So in general 1/C (regularisation strength) should be equal to your total sum of weights reweighting, since vaguely it is imlpemented as
LOSS := 1/C ||w||^2 + SUM_i sample_weight_i loss(pred(x_i), y_i)