Search code examples
pythonscikit-learnseabornlogistic-regression

Seaborn Regplot and Scikit-Learn Logistic Models Calculated Differently?


I'm using both the Scikit-Learn and Seaborn logistic regression functions -- the former for extracting model info (i.e. log-odds, parameters, etc.) and the later for plotting the resulting sigmoidal curve fit to the probability estimations.

Maybe my intuition is incorrect for how to interpret this plot, but I don't seem to be getting results as I'd expect:

#Build and visualize a simple logistic regression
ap_X = ap[['TOEFL Score']].values 
ap_y = ap['Chance of Admit'].values

ap_lr = LogisticRegression()
ap_lr.fit(ap_X, ap_y)

def ap_log_regplot(ap_X, ap_y):
    plt.figure(figsize=(15,10))
    sns.regplot(ap_X, ap_y, logistic=True, color='green')
    return None

ap_log_regplot(ap_X, ap_y)
plt.xlabel('TOEFL Score')
plt.ylabel('Probability')
plt.title('Logistic Regression: Probability of High Chance by TOEFL Score')
plt.show

enter image description here

Seems alright, but then I attempt to use the predict_proba function in Scikit-Learn to find the probabilities of Chance to Admit given some arbitrary value for TOEFL Score (in this case 108, 104, and 112):

eight = ap_lr.predict_proba(108)[:, 1]
four = ap_lr.predict_proba(104)[:, 1]
twelve = ap_lr.predict_proba(112)[:, 1]
print(eight, four, twelve)

Where I get:

[0.49939019] [0.44665597] [0.55213799]

To me, this seems to indicate that a TOEFL Score of 112 gives an individual a 55% chance of being admitted based on this data set. If I were to extend a vertical line from 112 on the x-axis to the sigmoid curve, I'd expect the intersection at around .90.

Am I interpreting/modeling this correctly? I realize that I'm using two different packages to calculate the model coefficients but with another model using a different data set, I seem to get correct predictions that fit the logistic curve.

Any ideas or am I completely modeling/interpreting this inaccurately?


Solution

  • After some searching, Cross-Validated provided the correct answer to my question. Although it already exists on Cross-Validated, I wanted to provide this answer on Stack Overflow as well.

    Simply put, Scikit-Learn automatically adds a regularization penalty to the logistic model that shrinks the coefficients. Statsmodels does not add this penalty. There is apparently no way to turn this off so one has to set the C= parameter within the LogisticRegression instantiation to some arbitrarily high value like C=1e9.

    After trying this and comparing the Scikit-Learn predict_proba() to the sigmoidal graph produced by regplot (which uses statsmodels for its calculation), the probability estimates align.

    Link to full post: https://stats.stackexchange.com/questions/203740/logistic-regression-scikit-learn-vs-statsmodels