python logistic-regression statsmodels weighted

How to use weights in a logistic regression

I want to calculate (weighted) logistic regression in Python. The weights were calculated to adjust the distribution of the sample regarding the population. However, the results don´t change if I use weights.

import numpy as np
import pandas as pd  
import statsmodels.api as sm

The data looks like this. The target variable is VISIT. The features are all other variables except WEIGHT_both (which is the weight I want to use).

df.head() 

WEIGHT_both VISIT   Q19_1   Q19_2   Q19_3   Q19_4   Q19_5   Q19_6   Q19_7   Q19_8   ... Q19_23  Q19_24  Q19_25  Q19_26  Q19_27  Q19_28  Q19_29  Q19_30  Q19_31  Q19_32
0   0.022320    1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 4.0 4.0 1.0 1.0 1.0 1.0 2.0 3.0 3.0 2.0
1   0.027502    1.0 3.0 2.0 2.0 2.0 3.0 4.0 3.0 2.0 ... 3.0 2.0 2.0 2.0 2.0 4.0 2.0 4.0 2.0 2.0
2   0.022320    1.0 2.0 3.0 1.0 4.0 3.0 3.0 3.0 2.0 ... 3.0 3.0 3.0 2.0 2.0 1.0 2.0 2.0 1.0 1.0
3   0.084499    1.0 2.0 2.0 2.0 2.0 2.0 4.0 1.0 1.0 ... 2.0 2.0 1.0 1.0 1.0 2.0 1.0 2.0 1.0 1.0
4   0.022320    1.0 3.0 4.0 3.0 3.0 3.0 2.0 3.0 3.0 ... 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0

Without the weight the model looks like this:

X = df.drop('WEIGHT_both', axis = 1)
X = X.drop('VISIT', axis = 1)
X = sm.add_constant(X)
w = = df['WEIGHT_both']
Y= df['VISIT']

fit = sm.Logit(Y, X).fit()

fit.summary()


Dep. Variable:  VISIT   No. Observations:   7971
Model:  Logit   Df Residuals:   7938
Method: MLE Df Model:   32
Date:   Sun, 05 Jul 2020    Pseudo R-squ.:  0.2485
Time:   16:41:12    Log-Likelihood: -3441.2
converged:  True    LL-Null:    -4578.8
Covariance Type:    nonrobust   LLR p-value:    0.000
coef    std err z   P>|z|   [0.025  0.975]
const   3.8098  0.131   29.126  0.000   3.553   4.066
Q19_1   -0.1116 0.063   -1.772  0.076   -0.235  0.012
Q19_2   -0.2718 0.061   -4.483  0.000   -0.391  -0.153
Q19_3   -0.2145 0.061   -3.519  0.000   -0.334  -0.095

With the sample weight the result looks like this (no change):

fit2 = sm.Logit(Y, X, sample_weight = w).fit() 
# same thing if I use class_weight

fit2.summary()


Dep. Variable:  VISIT   No. Observations:   7971
Model:  Logit   Df Residuals:   7938
Method: MLE Df Model:   32
Date:   Sun, 05 Jul 2020    Pseudo R-squ.:  0.2485
Time:   16:41:12    Log-Likelihood: -3441.2
converged:  True    LL-Null:    -4578.8
Covariance Type:    nonrobust   LLR p-value:    0.000
coef    std err z   P>|z|   [0.025  0.975]
const   3.8098  0.131   29.126  0.000   3.553   4.066
Q19_1   -0.1116 0.063   -1.772  0.076   -0.235  0.012
Q19_2   -0.2718 0.061   -4.483  0.000   -0.391  -0.153
Q19_3   -0.2145 0.061   -3.519  0.000   -0.334  -0.095

I calculated the regression with other Programms (e.g. SPSS, R). The weighted result has to be different.

Here is an example (R-Code).

Without weights (same result as with Python code):

fit = glm(VISIT~., data = df[ -c(1)] , family = "binomial")

summary(fit)

Call:
glm(formula = VISIT ~ ., family = "binomial", data = df[-c(1)])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.1216  -0.6984   0.3722   0.6838   2.1083  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.80983    0.13080  29.126  < 2e-16 ***
Q19_1       -0.11158    0.06296  -1.772 0.076374 .  
Q19_2       -0.27176    0.06062  -4.483 7.36e-06 ***
Q19_3       -0.21451    0.06096  -3.519 0.000434 ***
Q19_4        0.22417    0.05163   4.342 1.41e-05 ***

With weights:

fit2 = glm(VISIT~., data = df[ -c(1)],  weights = df$WEIGHT_both, family = "binomial")


summary(fit2)

Call:
glm(formula = VISIT ~ ., family = "binomial", data = df[-c(1)], 
    weights = df$WEIGHT_both)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4894  -0.3315   0.1619   0.2898   3.7878  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.950e-01  1.821e-01   2.718 0.006568 ** 
Q19_1       -6.497e-02  8.712e-02  -0.746 0.455835    
Q19_2       -1.720e-02  8.707e-02  -0.198 0.843362    
Q19_3       -1.114e-01  8.436e-02  -1.320 0.186743    
Q19_4        1.898e-02  7.095e-02   0.268 0.789066

Any idea how to use weights in a logistic regression?

Solution

I think one way is to use smf.glm() where you can provide the weights as freq_weights , you should check this section on weighted glm and see whether it is what you want to achieve. Below I provide an example where it is used in the same way as weights= in R :

import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm

data = sns.load_dataset("iris") 
data['species'] = (data['species'] == "versicolor").astype(int)

fit = smf.glm("species ~ sepal_length + sepal_width + petal_length + petal_width",
              family=sm.families.Binomial(),data=data).fit()

fit.summary()

    coef         std err    z   P>|z|   [0.025  0.975]
Intercept        7.3785 2.499   2.952   0.003   2.480   12.277
sepal_length    -0.2454 0.650   -0.378  0.706   -1.518  1.028
sepal_width     -2.7966 0.784   -3.569  0.000   -4.332  -1.261
petal_length     1.3136 0.684   1.921   0.055   -0.027  2.654
petal_width     -2.7783 1.173   -2.368  0.018   -5.078  -0.479

Now provide weights:

wts = np.repeat(np.arange(1,6),30)
fit = smf.glm("species ~ sepal_length + sepal_width + petal_length + petal_width",
              family=sm.families.Binomial(),data=data,freq_weights=wts).fit()
fit.summary()

    coef         std err    z   P>|z|   [0.025  0.975]
Intercept        8.7146 1.444   6.036   0.000   5.885   11.544
sepal_length    -0.2053 0.359   -0.571  0.568   -0.910  0.499
sepal_width     -2.7293 0.454   -6.012  0.000   -3.619  -1.839
petal_length     0.8920 0.365   2.440   0.015   0.176   1.608
petal_width     -2.8420 0.622   -4.570  0.000   -4.061  -1.623

So in R you have the unweighted:

glm(Species ~ .,data=data,family=binomial)

Call:  glm(formula = Species ~ ., family = binomial, data = data)

Coefficients:
 (Intercept)  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width  
      7.3785       -0.2454       -2.7966        1.3136       -2.7783  

Degrees of Freedom: 149 Total (i.e. Null);  145 Residual
Null Deviance:      191 
Residual Deviance: 145.1    AIC: 155.1

And the weighted model

glm(Species ~ .,data=data,family=binomial,weights=rep(1:5,each=30))

Call:  glm(formula = Species ~ ., family = binomial, data = data, weights = rep(1:5, 
    each = 30))

Coefficients:
 (Intercept)  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width  
      8.7146       -0.2053       -2.7293        0.8920       -2.8420  

Degrees of Freedom: 149 Total (i.e. Null);  145 Residual
Null Deviance:      572.9 
Residual Deviance: 448.9    AIC: 458.9