Search code examples
pythonlogistic-regressionstatsmodelsweighted

How to use weights in a logistic regression


I want to calculate (weighted) logistic regression in Python. The weights were calculated to adjust the distribution of the sample regarding the population. However, the results don´t change if I use weights.

import numpy as np
import pandas as pd  
import statsmodels.api as sm  

The data looks like this. The target variable is VISIT. The features are all other variables except WEIGHT_both (which is the weight I want to use).

df.head() 

WEIGHT_both VISIT   Q19_1   Q19_2   Q19_3   Q19_4   Q19_5   Q19_6   Q19_7   Q19_8   ... Q19_23  Q19_24  Q19_25  Q19_26  Q19_27  Q19_28  Q19_29  Q19_30  Q19_31  Q19_32
0   0.022320    1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 4.0 4.0 1.0 1.0 1.0 1.0 2.0 3.0 3.0 2.0
1   0.027502    1.0 3.0 2.0 2.0 2.0 3.0 4.0 3.0 2.0 ... 3.0 2.0 2.0 2.0 2.0 4.0 2.0 4.0 2.0 2.0
2   0.022320    1.0 2.0 3.0 1.0 4.0 3.0 3.0 3.0 2.0 ... 3.0 3.0 3.0 2.0 2.0 1.0 2.0 2.0 1.0 1.0
3   0.084499    1.0 2.0 2.0 2.0 2.0 2.0 4.0 1.0 1.0 ... 2.0 2.0 1.0 1.0 1.0 2.0 1.0 2.0 1.0 1.0
4   0.022320    1.0 3.0 4.0 3.0 3.0 3.0 2.0 3.0 3.0 ... 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0

Without the weight the model looks like this:

X = df.drop('WEIGHT_both', axis = 1)
X = X.drop('VISIT', axis = 1)
X = sm.add_constant(X)
w = = df['WEIGHT_both']
Y= df['VISIT']

fit = sm.Logit(Y, X).fit()

fit.summary()


Dep. Variable:  VISIT   No. Observations:   7971
Model:  Logit   Df Residuals:   7938
Method: MLE Df Model:   32
Date:   Sun, 05 Jul 2020    Pseudo R-squ.:  0.2485
Time:   16:41:12    Log-Likelihood: -3441.2
converged:  True    LL-Null:    -4578.8
Covariance Type:    nonrobust   LLR p-value:    0.000
coef    std err z   P>|z|   [0.025  0.975]
const   3.8098  0.131   29.126  0.000   3.553   4.066
Q19_1   -0.1116 0.063   -1.772  0.076   -0.235  0.012
Q19_2   -0.2718 0.061   -4.483  0.000   -0.391  -0.153
Q19_3   -0.2145 0.061   -3.519  0.000   -0.334  -0.095

With the sample weight the result looks like this (no change):

fit2 = sm.Logit(Y, X, sample_weight = w).fit() 
# same thing if I use class_weight

fit2.summary()


Dep. Variable:  VISIT   No. Observations:   7971
Model:  Logit   Df Residuals:   7938
Method: MLE Df Model:   32
Date:   Sun, 05 Jul 2020    Pseudo R-squ.:  0.2485
Time:   16:41:12    Log-Likelihood: -3441.2
converged:  True    LL-Null:    -4578.8
Covariance Type:    nonrobust   LLR p-value:    0.000
coef    std err z   P>|z|   [0.025  0.975]
const   3.8098  0.131   29.126  0.000   3.553   4.066
Q19_1   -0.1116 0.063   -1.772  0.076   -0.235  0.012
Q19_2   -0.2718 0.061   -4.483  0.000   -0.391  -0.153
Q19_3   -0.2145 0.061   -3.519  0.000   -0.334  -0.095

I calculated the regression with other Programms (e.g. SPSS, R). The weighted result has to be different.

Here is an example (R-Code).

Without weights (same result as with Python code):

fit = glm(VISIT~., data = df[ -c(1)] , family = "binomial")

summary(fit)

Call:
glm(formula = VISIT ~ ., family = "binomial", data = df[-c(1)])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.1216  -0.6984   0.3722   0.6838   2.1083  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.80983    0.13080  29.126  < 2e-16 ***
Q19_1       -0.11158    0.06296  -1.772 0.076374 .  
Q19_2       -0.27176    0.06062  -4.483 7.36e-06 ***
Q19_3       -0.21451    0.06096  -3.519 0.000434 ***
Q19_4        0.22417    0.05163   4.342 1.41e-05 ***

With weights:

fit2 = glm(VISIT~., data = df[ -c(1)],  weights = df$WEIGHT_both, family = "binomial")


summary(fit2)

Call:
glm(formula = VISIT ~ ., family = "binomial", data = df[-c(1)], 
    weights = df$WEIGHT_both)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4894  -0.3315   0.1619   0.2898   3.7878  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.950e-01  1.821e-01   2.718 0.006568 ** 
Q19_1       -6.497e-02  8.712e-02  -0.746 0.455835    
Q19_2       -1.720e-02  8.707e-02  -0.198 0.843362    
Q19_3       -1.114e-01  8.436e-02  -1.320 0.186743    
Q19_4        1.898e-02  7.095e-02   0.268 0.789066   

Any idea how to use weights in a logistic regression?


Solution

  • I think one way is to use smf.glm() where you can provide the weights as freq_weights , you should check this section on weighted glm and see whether it is what you want to achieve. Below I provide an example where it is used in the same way as weights= in R :

    import pandas as pd
    import numpy as np
    import seaborn as sns
    import statsmodels.formula.api as smf
    import statsmodels.api as sm
    
    data = sns.load_dataset("iris") 
    data['species'] = (data['species'] == "versicolor").astype(int)
    
    fit = smf.glm("species ~ sepal_length + sepal_width + petal_length + petal_width",
                  family=sm.families.Binomial(),data=data).fit()
    
    fit.summary()
    
        coef         std err    z   P>|z|   [0.025  0.975]
    Intercept        7.3785 2.499   2.952   0.003   2.480   12.277
    sepal_length    -0.2454 0.650   -0.378  0.706   -1.518  1.028
    sepal_width     -2.7966 0.784   -3.569  0.000   -4.332  -1.261
    petal_length     1.3136 0.684   1.921   0.055   -0.027  2.654
    petal_width     -2.7783 1.173   -2.368  0.018   -5.078  -0.479
    

    Now provide weights:

    wts = np.repeat(np.arange(1,6),30)
    fit = smf.glm("species ~ sepal_length + sepal_width + petal_length + petal_width",
                  family=sm.families.Binomial(),data=data,freq_weights=wts).fit()
    fit.summary()
    
        coef         std err    z   P>|z|   [0.025  0.975]
    Intercept        8.7146 1.444   6.036   0.000   5.885   11.544
    sepal_length    -0.2053 0.359   -0.571  0.568   -0.910  0.499
    sepal_width     -2.7293 0.454   -6.012  0.000   -3.619  -1.839
    petal_length     0.8920 0.365   2.440   0.015   0.176   1.608
    petal_width     -2.8420 0.622   -4.570  0.000   -4.061  -1.623
    

    So in R you have the unweighted:

    glm(Species ~ .,data=data,family=binomial)
    
    Call:  glm(formula = Species ~ ., family = binomial, data = data)
    
    Coefficients:
     (Intercept)  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width  
          7.3785       -0.2454       -2.7966        1.3136       -2.7783  
    
    Degrees of Freedom: 149 Total (i.e. Null);  145 Residual
    Null Deviance:      191 
    Residual Deviance: 145.1    AIC: 155.1
    

    And the weighted model

    glm(Species ~ .,data=data,family=binomial,weights=rep(1:5,each=30))
    
    Call:  glm(formula = Species ~ ., family = binomial, data = data, weights = rep(1:5, 
        each = 30))
    
    Coefficients:
     (Intercept)  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width  
          8.7146       -0.2053       -2.7293        0.8920       -2.8420  
    
    Degrees of Freedom: 149 Total (i.e. Null);  145 Residual
    Null Deviance:      572.9 
    Residual Deviance: 448.9    AIC: 458.9