Search code examples
pythonpandasscikit-learnstatsmodels

LogisticRegression from sk_learn and smf.logit() from statsmodels.formula.api return different results


I am trying to calculate the variance of the coefficients for logistic regression using bootstrap and I am using scikit-learn and statsmodels to compare results. I am using the Default dataset from the ISLR website which can be found in the zip forlder here or here as a plain csv file. I am using the following codes to perform the bootstrap:

Import the Dataset and create the response variable

default_df = pd.read_csv("./Machine-Learning-Books-With-Python/Introduction to Statistical Learning/data/default.csv")

default_df['default_01'] = np.where(default_df.default == 'Yes', 1, 0)

Next, I am defining the boot function which will take care of the random sampling for my dataset:

def boot(X, bootSample_size=None):
    '''
    Sampling observations from a dataframe

    Parameters
    ------------
    X : pandas dataframe
        Data to be resampled

    bootSample_size: int, optional
        Dimension of the bootstrapped samples

    Returns
    ------------
    bootSample_X : pandas dataframe
                   Resampled data

    Examples
    ----------
    To resample data from the X dataframe:
        >> boot(X)
    The resampled data will have length equal to len(X).

    To resample data from the X dataframe in order to have length 5:
        >> boot(X,5)

    References
    ------------
    http://nbviewer.jupyter.org/gist/aflaxman/6871948

    '''
    #assign default size if non-specified
    if bootSample_size == None:
        bootSample_size = len(X)

    #create random integers to use as indices for bootstrap sample based on original data
    bootSample_i = (np.random.rand(bootSample_size)*len(X)).astype(int)
    bootSample_i = np.array(bootSample_i)
    bootSample_X = X.iloc[bootSample_i]

    return bootSample_X

Finally, I define two functions that will perform the logistic regression and extract the parameters:

Using statsmodels

def boot_fn(data):
    lr = smf.logit(formula='default_01 ~ income + balance', data=data).fit(disp=0)
    return lr.params

Using scikit-learn

def boot_fn2(data):
    X = data[['income', 'balance']]
    y = data.default_01
    logit = LogisticRegression(C = 1e9)
    logit.fit(X, y)
    return logit.coef_

Finally the loop to run the functions 100 times and store the results:

coef_sk = []
coef_sm = []
for _ in np.arange(100):
    data = boot(default_df)
    coef_sk.append(boot_fn2(data))
    coef_sm.append(boot_fn(data))

Taking the mean for both coef_sk (scikit-learn) and coef_sm (statsmodels) I see that the one generated using statsmodels is much closer to the real value and also for different runs the scikit-learn coefficients appear to diverge quite a lot from the actual value. Could you please explain why this happens? I find it very confusing since I would expect that for the same datasets, results should be the same (at least marginally different). However, in this case, results differ a lot, which leads me to believe that there is something wrong with the way I am running the sk-learn version. Would appreciate any help and I would be more than happy to provide additional clarifications.


Solution

  • Although you set the C parameter to be high to minimize, sklearn by default uses lbfgs solver to find your optimal parameters while statsmodels uses newton .

    You can try doing this to get similar coefficients:

    def boot_fn2(data):
        X = data[['income', 'balance']]
        y = data.default_01
        logit = LogisticRegression(penalty="none",max_iter=1000,solver = "newton-cg")
        logit.fit(X, y)
        return logit.coef_
    

    If I run this with the above function:

    coef_sk = []
    coef_sm = []
    for _ in np.arange(50):
        data = boot(default_df)
        coef_sk.append(boot_fn2(data))
        coef_sm.append(boot_fn(data))
    

    You will instantly see that it throws a lot of warnings about being unable to converge:

    LineSearchWarning: The line search algorithm did not converge
    

    Although the coefficients are similar now, it points to a larger issue with your dataset, similar to this question

    np.array(coef_sm)[:,1:].mean(axis=0)
    array([2.14570133e-05, 5.68280785e-03])
    
    np.array(coef_sk).mean(axis=0)
    array([[2.14352318e-05, 5.68116402e-03]])
    

    Your dependent variables are quite huge and this poses a problem for the optimization methods available in sklearn. You can just scale two of your dependent variables down if you want to interpret the coefficients:

    default_df[['balance','income']] = default_df[['balance','income']]/100
    

    Else it's always a good practice to scale your independent variables first and apply the regression:

    from sklearn.preprocessing import StandardScaler
    default_df[['balance','income']] = StandardScaler().fit_transform(default_df[['balance','income']])
    
    def boot_fn(data):
        lr = smf.logit(formula='default_01 ~ income + balance', data=data).fit(disp=0)
        return lr.params
    def boot_fn2(data):
        X = data[['income', 'balance']]
        y = data.default_01
        logit = LogisticRegression(penalty="none")
        logit.fit(X, y)
        return logit.coef_
    
    coef_sk = []
    coef_sm = []
    for _ in np.arange(50):
        data = boot(default_df)
        #print(data.default_01.mean())
        coef_sk.append(boot_fn2(data))
        coef_sm.append(boot_fn(data))
    

    Now you'll see the coefficients are similar:

    np.array(coef_sm)[:,1:].mean(axis=0)    
    array([0.26517582, 2.71598194])
    
    np.array(coef_sk).mean(axis=0)
    array([[0.26517504, 2.71598548]])