Search code examples
pythonmachine-learningscikit-learnlogistic-regressionpatsy

python logistic regression (beginner)


I'm working on teaching myself a bit of logistic regression using python. I'm trying to apply the lessons in the walkthrough here to the small dataset in the wikipedia entryhere.

Something doesn't seem quite right. Wikipedia and Excel Solver (verified using the method in this video) give intercept -4.0777 and coefficient 1.5046, but the code I built out from the github example outputs -0.924200 and 0.756024, respectively.

The code I'm attempting to use is below. Any glaring mistakes?

import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression


X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])

y, X = dmatrices('p_or_f ~ study_hrs',
                  df, return_type="dataframe")

y = np.ravel(y)

model = LogisticRegression()
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))

>>>
                  0
Intercept -0.924200
study_hrs  0.756024

Solution

  • Solution

    Just change the model creation line to

    model = LogisticRegression(C=100000, fit_intercept=False)
    

    Analysis of the problem

    By default, sklearn solves regularized LogisticRegression, with fitting strength C=1 (small C-big regularization, big C-small regularization).

    This class implements regularized logistic regression using the liblinear library, newton-cg and lbfgs solvers. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).

    Thus to obtain their model you should fit

    model = LogisticRegression(C=1000000)
    

    which gives

    Intercept -2.038853 # this is actually half the intercept
    study_hrs  1.504643 # this is correct
    

    Furthermore the problem also lies in the way you work with data in patsy, see the simplified, correct example

    import numpy as np
    from sklearn.linear_model import LogisticRegression
    
    X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
    3.5,4.0,4.25,4.5,4.75,5.0,5.5]
    y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
    
    X = np.array([[x] for x in X])
    y = np.ravel(y)
    
    model = LogisticRegression(C=1000000.)
    model = model.fit(X,y)
    
    print('coef', model.coef_)
    print('intercept', model.intercept_)
    

    gives

    coef [[ 1.50464059]]
    intercept [-4.07769916]
    

    What is the problem exactly? When you do dmatrices it by default embeds your input data with a column of ones (biases)

    X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
    3.5,4.0,4.25,4.5,4.75,5.0,5.5]
    y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
    
    zipped = list(zip(X,y))
    df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
    
    y, X = dmatrices('p_or_f ~ study_hrs',
                      df, return_type="dataframe")
    
    print(X)
    

    which leads to

        Intercept  study_hrs
    0           1       0.50
    1           1       0.75
    2           1       1.00
    3           1       1.25
    4           1       1.50
    5           1       1.75
    6           1       1.75
    7           1       2.00
    8           1       2.25
    9           1       2.50
    10          1       2.75
    11          1       3.00
    12          1       3.25
    13          1       3.50
    14          1       4.00
    15          1       4.25
    16          1       4.50
    17          1       4.75
    18          1       5.00
    19          1       5.50
    

    and this is why the resulting bias is just a half of the true one - scikit learns also added a column of ones... so you now have two biases, thus optimal solution is to give each of them half of the weight which would be given to a single one.

    So what you can do?

    • do not use patsy in such a way
    • forbid patsy to add a bias
    • tell sklearn not to add bias

    .

    import numpy as np
    import pandas as pd
    from patsy import dmatrices
    from sklearn.linear_model import LogisticRegression
    
    X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
    3.5,4.0,4.25,4.5,4.75,5.0,5.5]
    y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
    
    zipped = list(zip(X,y))
    df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
    
    y, X = dmatrices('p_or_f ~ study_hrs',
                      df, return_type="dataframe")
    
    y = np.ravel(y)
    
    model = LogisticRegression(C=100000, fit_intercept=False)
    model = model.fit(X,y)
    print(pd.DataFrame(np.transpose(model.coef_),X.columns))
    

    gives

    Intercept -4.077571
    study_hrs  1.504597
    

    as desired