Search code examples
pythonregressionprobabilitylogistic-regression

Whats the difference between sklearn logistic regressions and seaborn logistic regressions?


I have a system where I'm using the consensus opinion of weighted votes to predict a binary outcome.

Since elections are topical we can use it as an example. Say I do an analysis on various pollsters over the years and assign them a weighted vote based on how accurate they were. Pollster Skyler ends up with a vote weight of 3 and Pollster Oakely was twice as accurate and ends up with a vote weight of 6. In CGP Grey fashion (a youtuber who sometimes talks about election mechanics) Skyler predicts that the Tiger candidate will win the open city council seat and Oakley predicts that the SnowLepoard candidate will win it. In this example, the projected winner, based on total weighted votes, would be SnowLepoard with a majority weighted vote percent of (6/9) 0.66667. The actual outcome of any given election would of course vary, but generally, if the weights are good, we'd expect the win probability of a given candidate to increases as the majority weighted vote percent increases and, likewise, if the majority percent is 50% for the election to be a real tossup.

With that example in mind, I'm trying to do a logistic regression on a dataset where the only independent variable is the majority weighted vote percent. I've tried a couple of different methods and while both generally agreed that win probability increases as majority weighted vote percent increases, they disagree as to how much and neither really respects the idea that a 50/50 majority weighted vote percent indicates a true 50/50 probability.

enter image description here

The blue line is the logistic regression done through seaborn and the green dots are the logistic regression done through sklearn. Code below. I dont think that the underlying mechanics of a logistic regression change from one library to another so clearly if they're producing different outputs for the same input, my setup is wrong.

  • Why are these two libraries producing different regressions?
  • How can I force the regression, for either library, to treat a weighted vote majority of 0.5 as a 50% win probability? I can probably just fill in a mass of dummy data to force the conclusion but I feel like theres got to be a more elegant way.
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression

# Setup a data dict
majorityPercentList = [0.8387096774193549, 0.8387096774193549, 1.0, 1.0, 1.0, 1.0, 0.8387096774193549, 0.8387096774193549, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8387096774193549, 0.8947368421052632, 0.6578947368421053, 1.0, 0.7894736842105263, 1.0, 0.7631578947368421, 0.8947368421052632, 0.8421052631578947, 0.5789473684210527, 1.0, 0.8421052631578947, 0.9210526315789473, 1.0, 0.7894736842105263, 0.6842105263157895, 0.7894736842105263, 0.8, 0.6, 1.0, 1.0, 1.0, 1.0, 0.8, 1.0, 1.0, 1.0, 0.6, 0.8, 0.8, 1.0, 0.6, 1.0, 0.8823529411764706, 0.6470588235294118, 0.9444444444444444, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.6666666666666666, 1.0, 1.0, 0.9444444444444444, 0.7755102040816326, 1.0, 1.0, 0.84, 0.8, 1.0, 0.98, 0.98, 0.84, 0.98, 1.0, 0.98, 1.0, 0.8, 1.0, 0.8082191780821918, 0.9864864864864865, 0.9324324324324325, 0.9054054054054054, 0.9864864864864865, 0.8108108108108109, 0.7837837837837838, 0.972972972972973, 0.9324324324324325, 0.9054054054054054, 0.8918918918918919, 0.8918918918918919, 0.5066666666666667, 0.8666666666666667]
outcomeList = [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1]
dataDict = {"majorityPercent": majorityPercentList,
            "isMajorityWinner": outcomeList}

# Setup the dataframe
df = pd.DataFrame(dataDict,columns=['majorityPercent', 'isMajorityWinner'])
x = df[["majorityPercent"]]
y = df['isMajorityWinner']

# Run the sklearn logistic regression
logistic_regression= LogisticRegression()
logistic_regression.fit(x,y)
plt.scatter(x,logistic_regression.predict_proba(x)[:,1], c="green")

# Run the seaborn version
sns.regplot(x="majorityPercent", y="isMajorityWinner",
    data = df,
    logistic = True,
    ci = None)

# Show the graphs
plt.show()

Solution

  • LogisticRegression in sklearn does a penalized regression, you can get more details in the help page, whereas seaborn uses statsmodels to perform the fit, which is not penalized.

    Setting the penalty to none in sklearn would give you the same results:

    logistic_regression= LogisticRegression(penalty='none')
    logistic_regression.fit(x,y)
    plt.scatter(x,logistic_regression.predict_proba(x)[:,1], c="green")
    
    sns.regplot(x="majorityPercent", y="isMajorityWinner",
        data = df,
        logistic = True,
        ci = None)
    

    enter image description here

    If you want to force the predicted probability at 50% vote majority, you shift the x variable by 0.5 and fit a regression without intercept, such that at x=0, the predicted log odds = 0, meaning probability = 0.5:

    df = pd.DataFrame(dataDict,columns=['majorityPercent', 'isMajorityWinner'])
    df['majorityPercent_scaled'] = df[["majorityPercent"]] -0.5
    
    logistic_regression= LogisticRegression(penalty='none',fit_intercept=False)
    logistic_regression.fit(df[['majorityPercent_scaled']],df['isMajorityWinner'])
    df.plot.scatter(x='majorityPercent',y='isMajorityWinner')
    plt.scatter(df["majorityPercent"],logistic_regression.predict_proba(df[['majorityPercent_scaled']])[:,1], c="green")
    

    enter image description here

    There's no way to fit a model without intercept inside regplot(), so you have to do it from scratch using statsmodels.