Search code examples
pythonrscikit-learnlogistic-regression

Comparison of R, statmodels, sklearn for a classification task with logistic regression


I have made some experiments with logistic regression in R, python statmodels and sklearn. While the results given by R and statmodels agree, there is some discrepency with what is returned by sklearn. I would like to understand why these results are different. I understand that it is probably not the same optimization algorithms that are used under the wood.

Specifically, I use the standard Default dataset (used in the ISL book). The following Python code reads the data into a dataframe Default.

import pandas as pd
 # data is available here
Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col=0)
 #
Default['default']=Default['default'].map({'No':0, 'Yes':1})
Default['student']=Default['student'].map({'No':0, 'Yes':1})
 #
I=Default['default']==0
print("Number of 'default' values :", Default[~I]['balance'].count())

Number of 'default' values : 333.

There is a total of 10000 examples, with only 333 positives

Logistic regression in R

I use the following

library("ISLR")
data(Default,package='ISLR')
 #write.csv(Default,"default.csv")
glm.out=glm('default~balance+income+student', family=binomial, data=Default)
s=summary(glm.out)
print(s)
#
glm.probs=predict(glm.out,type="response") 
glm.probs[1:5]
glm.pred=ifelse(glm.probs>0.5,"Yes","No")
 #attach(Default)
t=table(glm.pred,Default$default)
print(t)
score=mean(glm.pred==Default$default)
print(paste("score",score))

The result is as follows

Call: glm(formula = "default~balance+income+student", family = binomial, data = Default)

Deviance Residuals: Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383

Coefficients:

Estimate Std. Error z value Pr(>|z|)        
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 
balance      5.737e-03    2.319e-04  24.738  < 2e-16  
income       3.033e-06  8.203e-06   0.370   0.71152     
studentYes  -6.468e-01  2.363e-01  -2.738  0.00619  

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2920.6  on 9999  degrees of freedom Residual 

deviance: 1571.5 on 9996 degrees of freedom AIC: 1579.5

Number of Fisher Scoring iterations: 8

     glm.pred   No  Yes
 No  9627  228
 Yes   40  105 

1 "score 0.9732"

I am too lazy to cut and paste the results obtained with statmodels. It suffice to say that they are extremely similar to those given by R.

sklearn

For sklearn, I ran the following code.

  • There is a parameter class_weight for taking into account unbalanced classes. I tested class_weight=None (no weightening -- I think that is the default in R), and class_weight='auto' (weightening with the inverse frequencies found inthe data)
  • I also put C=10000,the inverse of the regularization parameter, so as to minimize the effect of regularization.

~~

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

features = Default[[ 'balance', 'income' ]]
target = Default['default']
# 
for weight in (None,  'auto'):
    print("*"*40+"\nweight:",weight)

    classifier = LogisticRegression(C=10000, class_weight=weight, random_state=42) 
                #C=10000 ~ no regularization

    classifier.fit(features, target,)  #fit classifier on whole base
    print("Intercept", classifier.intercept_)
    print("Coefficients", classifier.coef_)

    y_true=target
    y_pred_cls=classifier.predict_proba(features)[:,1]>0.5
    C=confusion_matrix(y_true,y_pred_cls)

    score=(C[0,0]+C[1,1])/(C[0,0]+C[1,1]+C[0,1]+C[1,0])
    precision=(C[1,1])/(C[1,1]+C[0 ,1])
    recall=(C[1,1])/(C[1,1]+C[1,0])
    print("\n Confusion matrix")
    print(C)
    print()
    print('{s:{c}<{n}}{num:2.4}'.format(s='Score',n=15,c='', num=score))
    print('{s:{c}<{n}}{num:2.4}'.format(s='Precision',n=15,c='', num=precision))
    print('{s:{c}<{n}}{num:2.4}'.format(s='Recall',n=15,c='', num=recall))

The results are given below.

> **************************************** 
>weight: None 
>
>Intercept [ -1.94164126e-06] 
>
>Coefficients [[ 0.00040756 -0.00012588]]
> 
>  Confusion matrix 
>
>     [[9664    3]  
>     [ 333    0]]
> 
>     Score          0.9664 
>     Precision      0.0 
>     Recall         0.0
>
> **************************************** 
>weight: auto 
>
>Intercept [-8.15376429] 
>
>Coefficients 
>[[  5.67564834e-03   1.95253338e-05]]
> 
>  Confusion matrix 
>
>     [[8356 1311]  
>     [  34  299]]
> 
>     Score          0.8655 
>     Precision      0.1857 
>     Recall         0.8979

What I observe is that for class_weight=None, the Score is excellent but no positive example is recognized. Precision and recall are at zero. The coefficients found are very small, particularly the intercept. Modifying C does not change things. For class_weight='auto' things seems better but I still have a precision which is very low (too much positive classified). Again, changing C does not help. If I modify the intercept by hand, I can recover the results given by R. So I suspect that here is a discrepency between the estimation of the intecepts in the two cases. As this has a consequence in the specification of the threeshold (analog to a resampling of pulations), this can explain the differences in performances.

However, I would welcome any advice for the choice between the two solutions and help to understand the origin of these differences. Thanks.


Solution

  • I ran into a similar issue and ended up posting about it on /r/MachineLearning. It turns out the difference can be attributed to data standardization. Whatever approach scikit-learn is using to find the parameters of the model will yield better results if the data is standardized. scikit-learn has some documentation discussing preprocessing data (including standardization), which can be found here.

    Results

    Number of 'default' values : 333
    Intercept: [-6.12556565]
    Coefficients: [[ 2.73145133  0.27750788]]
    
    Confusion matrix
    [[9629   38]
     [ 225  108]]
    
    Score          0.9737
    Precision      0.7397
    Recall         0.3243
    

    Code

    # scikit-learn vs. R
    # http://stackoverflow.com/questions/28747019/comparison-of-r-statmodels-sklearn-for-a-classification-task-with-logistic-reg
    
    import pandas as pd
    import sklearn
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import confusion_matrix
    from sklearn import preprocessing
    
    # Data is available here.
    Default = pd.read_csv('https://d1pqsl2386xqi9.cloudfront.net/notebooks/Default.csv', index_col = 0)
    
    Default['default'] = Default['default'].map({'No':0, 'Yes':1})
    Default['student'] = Default['student'].map({'No':0, 'Yes':1})
    
    I = Default['default'] == 0
    print("Number of 'default' values : {0}".format(Default[~I]['balance'].count()))
    
    feats = ['balance', 'income']
    
    Default[feats] = preprocessing.scale(Default[feats])
    
    # C = 1e6 ~ no regularization.
    classifier = LogisticRegression(C = 1e6, random_state = 42) 
    
    classifier.fit(Default[feats], Default['default'])  #fit classifier on whole base
    print("Intercept: {0}".format(classifier.intercept_))
    print("Coefficients: {0}".format(classifier.coef_))
    
    y_true = Default['default']
    y_pred_cls = classifier.predict_proba(Default[feats])[:,1] > 0.5
    
    confusion = confusion_matrix(y_true, y_pred_cls)
    score = float((confusion[0, 0] + confusion[1, 1])) / float((confusion[0, 0] + confusion[1, 1] + confusion[0, 1] + confusion[1, 0]))
    precision = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[0, 1]))
    recall = float((confusion[1, 1])) / float((confusion[1, 1] + confusion[1, 0]))
    print("\nConfusion matrix")
    print(confusion)
    print('\n{s:{c}<{n}}{num:2.4}'.format(s = 'Score', n = 15, c = '', num = score))
    print('{s:{c}<{n}}{num:2.4}'.format(s = 'Precision', n = 15, c = '', num = precision))
    print('{s:{c}<{n}}{num:2.4}'.format(s = 'Recall', n = 15, c = '', num = recall))