Search code examples
pythonlogistic-regressionstatsmodelsglm

Generalized logistic regression Python: how to correctly define binary independent variable?


I am using weighted Generalized linear models (statsmodels) for classification:

import statsmodels.api as sm
model= sm.GLM(y, x_with_intercept, max_iter=500, random_state=42, family=sm.families.Binomial(),freq_weights=weights)

One of the variables in x_with_intercept is binary. I thought there would be no issue in including a binary variable, though the output of the model produces extremely high standard errors, in comparison to other variables.

enter image description here

Is there a way to account for the binary variable correctly in the model definition?


Solution

  • Looking at your coefficients, it is weird that even the intercept has a high standard error. Most likely your binary variable has too little positive cases and is confounded with your other variables, so it has a problem estimating that.

    For example, if we have a normal binary predictor, no problem:

    import numpy as np
    import pandas as pd
    import statsmodels.api as sm
    
    np.random.seed(123)
    
    x = pd.DataFrame({'b1':np.random.binomial(1,0.5,100),
    'b2':np.random.binomial(1,0.5,100),
    'c1':np.random.uniform(0,1,100),
    'c2':np.random.normal(0,1,100)})
    
    y = np.random.binomial(1,0.5,100)
    wt = np.random.poisson(10,100)
    
    model= sm.GLM(y, sm.add_constant(x), max_iter=500, 
    random_state=42, family=sm.families.Binomial(),
    freq_weights=wt)
    
    results = model.fit()
    

    Looks normal:

    ==============================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
    ------------------------------------------------------------------------------
    const         -0.3858      0.149     -2.592      0.010      -0.677      -0.094
    b1             0.1965      0.132      1.490      0.136      -0.062       0.455
    b2             0.6114      0.140      4.362      0.000       0.337       0.886
    c1            -0.5621      0.214     -2.621      0.009      -0.982      -0.142
    c2            -0.5108      0.072     -7.113      0.000      -0.651      -0.370
    ==============================================================================
    

    We make one of the binary variable having only 1 positive:

    x['b1'] = (x['c1']>0.99).astype(int)
    
    x.b1.sum()
    2
    
    model= sm.GLM(y, sm.add_constant(x), max_iter=500, 
    random_state=42, family=sm.families.Binomial(),
    freq_weights=wt)
    
    results = model.fit()
    

    I get the huge standard error:

                     Generalized Linear Model Regression Results                  
    ==============================================================================
    Dep. Variable:                      y   No. Observations:                  100
    Model:                            GLM   Df Residuals:                     1039
    Model Family:                Binomial   Df Model:                            4
    Link Function:                  logit   Scale:                          1.0000
    Method:                          IRLS   Log-Likelihood:                -675.80
    Date:                Sat, 24 Apr 2021   Deviance:                       1351.6
    Time:                        11:39:50   Pearson chi2:                 1.02e+03
    No. Iterations:                    22                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          z      P>|z|      [0.025      0.975]
    ------------------------------------------------------------------------------
    const         -0.2151      0.128     -1.685      0.092      -0.465       0.035
    b1            23.6170   1.73e+04      0.001      0.999    -3.4e+04     3.4e+04
    b2             0.6002      0.138      4.358      0.000       0.330       0.870
    c1            -0.7671      0.220     -3.482      0.000      -1.199      -0.335
    c2            -0.4366      0.072     -6.037      0.000      -0.578      -0.295
    ==============================================================================
    

    I would check whether it's worthwhile to include that variable