python machine-learning scikit-learn logistic-regression patsy

python logistic regression (beginner)

I'm working on teaching myself a bit of logistic regression using python. I'm trying to apply the lessons in the walkthrough here to the small dataset in the wikipedia entryhere.

Something doesn't seem quite right. Wikipedia and Excel Solver (verified using the method in this video) give intercept -4.0777 and coefficient 1.5046, but the code I built out from the github example outputs -0.924200 and 0.756024, respectively.

The code I'm attempting to use is below. Any glaring mistakes?

import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression


X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])

y, X = dmatrices('p_or_f ~ study_hrs',
                  df, return_type="dataframe")

y = np.ravel(y)

model = LogisticRegression()
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))

>>>
                  0
Intercept -0.924200
study_hrs  0.756024

Solution

Just change the model creation line to

model = LogisticRegression(C=100000, fit_intercept=False)

Analysis of the problem

By default, sklearn solves regularized LogisticRegression, with fitting strength C=1 (small C-big regularization, big C-small regularization).

This class implements regularized logistic regression using the liblinear library, newton-cg and lbfgs solvers. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).

Thus to obtain their model you should fit

model = LogisticRegression(C=1000000)

which gives

Intercept -2.038853 # this is actually half the intercept
study_hrs  1.504643 # this is correct

Furthermore the problem also lies in the way you work with data in patsy, see the simplified, correct example

import numpy as np
from sklearn.linear_model import LogisticRegression

X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

X = np.array([[x] for x in X])
y = np.ravel(y)

model = LogisticRegression(C=1000000.)
model = model.fit(X,y)

print('coef', model.coef_)
print('intercept', model.intercept_)

gives

coef [[ 1.50464059]]
intercept [-4.07769916]

What is the problem exactly? When you do dmatrices it by default embeds your input data with a column of ones (biases)

X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])

y, X = dmatrices('p_or_f ~ study_hrs',
                  df, return_type="dataframe")

print(X)

which leads to

    Intercept  study_hrs
0           1       0.50
1           1       0.75
2           1       1.00
3           1       1.25
4           1       1.50
5           1       1.75
6           1       1.75
7           1       2.00
8           1       2.25
9           1       2.50
10          1       2.75
11          1       3.00
12          1       3.25
13          1       3.50
14          1       4.00
15          1       4.25
16          1       4.50
17          1       4.75
18          1       5.00
19          1       5.50

and this is why the resulting bias is just a half of the true one - scikit learns also added a column of ones... so you now have two biases, thus optimal solution is to give each of them half of the weight which would be given to a single one.

So what you can do?

do not use patsy in such a way
forbid patsy to add a bias
tell sklearn not to add bias

import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression

X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])

y, X = dmatrices('p_or_f ~ study_hrs',
                  df, return_type="dataframe")

y = np.ravel(y)

model = LogisticRegression(C=100000, fit_intercept=False)
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))

gives

Intercept -4.077571
study_hrs  1.504597

as desired