Search code examples
pythonseabornlogistic-regressionstatsmodels

regplot logistic and statsmodels logit. Why different results?


Why in this code, coefficients (intercept and x) are different between the logistic seaborn regplot visualization and the statsmodel logit() analysis? Shouldn't the two lines start at the same intercept at least? What I'm I doing wrong?

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.formula.api import logit

np.random.seed(2022)  # to get the same data each time
df = pd.DataFrame({
    'y': np.random.randint(2, size=10),
    'x': np.random.rand(10)
})

mdl = logit("y ~ x", data=df).fit()
print(mdl.summary())
sns.regplot(y='y', x='x', data=df, logistic=True, ci=None)
plt.axline(xy1=(0, mdl.params[0]), slope=mdl.params[1], color='black')
plt.show()

Output

Optimization terminated successfully.
         Current function value: 0.665054
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   10
Model:                          Logit   Df Residuals:                        8
Method:                           MLE   Df Model:                            1
Date:                Tue, 26 Jul 2022   Pseudo R-squ.:                 0.04053
Time:                        07:43:10   Log-Likelihood:                -6.6505
converged:                       True   LL-Null:                       -6.9315
Covariance Type:            nonrobust   LLR p-value:                    0.4535
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0253      2.902      0.698      0.485      -3.663       7.713
x             -2.7006      3.741     -0.722      0.470     -10.033       4.632
==============================================================================

enter image description here


Solution

  • What you are seeing in the sns.regplot() plot is the plot of probabilities, not of logits (i.e. linear regression line with the estimated intercept and slope). So to match that plot using the results of your logit model, you have to compute a probability value for each x value using the intercept and slope.

    Probabilities are computed by first computing logits (linear combinations of your estimated intercept and slope and the x values):

    logits = mdl.params[0] + mdl.params[1] * df['x']
    

    and then passing them through the sigmoid function:

    probs = np.exp(logits) / (1 + np.exp(logits))
    

    Here is the full code and plot of both lines:

    import pandas as pd
    import seaborn as sns
    import numpy as np
    import matplotlib.pyplot as plt
    from statsmodels.formula.api import logit
    
    np.random.seed(2022)  # to get the same data each time
    df = pd.DataFrame({
        'y': np.random.randint(2, size=10),
        'x': np.random.rand(10)
    })
    
    mdl = logit("y ~ x", data=df).fit()
    print(mdl.summary())
    
    logits = mdl.params[0] + mdl.params[1] * df['x']
    probs = np.exp(logits) / (1 + np.exp(logits))
    
    sns.regplot(y='y', x='x', data=df, logistic=True, ci=None)
    plt.plot(df['x'], probs, color='red')
    plt.show()
    

    enter image description here