Search code examples
pythonregressionjupyter-notebooklogistic-regression

Logistic Regression in python using Logit() and fit()


I am trying to perform logistic regression in python using the following code -

from patsy import dmatrices
import numpy as np
import pandas as pd
import statsmodels.api as sm

df=pd.read_csv('C:/Users/Documents/titanic.csv')
df=df.drop(['ticket','cabin','name','parch','sibsp','fare'],axis=1) #remove columns from table
df=df.dropna() #dropping null values

formula = 'survival ~ C(pclass) + C(sex) + age' 
df_train = df.iloc[ 0: 6, : ] 
df_test = df.iloc[ 6: , : ]

#spliting data into dependent and independent variables
y_train,x_train = dmatrices(formula, data=df_train,return_type='dataframe')
y_test,x_test = dmatrices(formula, data=df_test,return_type='dataframe')

#instantiate the model
model = sm.Logit(y_train,x_train)
res=model.fit()
res.summary()

I am getting error at this line-

--->res=model.fit()

I have no missing values in the data set. However, my dataset is very small with just 10 entries. I am not sure what is going wrong here and how can i fix it? I am running the program in Jupyter notebook. The whole error message is given below-

    ---------------------------------------------------------------------------
PerfectSeparationError                    Traceback (most recent call last)
<ipython-input-37-c6a47ec170d5> in <module>()
     19 y_test,x_test = dmatrices(formula, data=df_test,return_type='dataframe')
     20 model = sm.Logit(y_train,x_train)
---> 21 res=model.fit()
     22 res.summary()

C:\Program Files\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
   1374         bnryfit = super(Logit, self).fit(start_params=start_params,
   1375                 method=method, maxiter=maxiter, full_output=full_output,
-> 1376                 disp=disp, callback=callback, **kwargs)
   1377 
   1378         discretefit = LogitResults(self, bnryfit)

C:\Program Files\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py in fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
    201         mlefit = super(DiscreteModel, self).fit(start_params=start_params,
    202                 method=method, maxiter=maxiter, full_output=full_output,
--> 203                 disp=disp, callback=callback, **kwargs)
    204 
    205         return mlefit # up to subclasses to wrap results

C:\Program Files\Anaconda3\lib\site-packages\statsmodels\base\model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    423                                                        callback=callback,
    424                                                        retall=retall,
--> 425                                                        full_output=full_output)
    426 
    427         #NOTE: this is for fit_regularized and should be generalized

C:\Program Files\Anaconda3\lib\site-packages\statsmodels\base\optimizer.py in _fit(self, objective, gradient, start_params, fargs, kwargs, hessian, method, maxiter, full_output, disp, callback, retall)
    182                             disp=disp, maxiter=maxiter, callback=callback,
    183                             retall=retall, full_output=full_output,
--> 184                             hess=hessian)
    185 
    186         # this is stupid TODO: just change this to something sane

C:\Program Files\Anaconda3\lib\site-packages\statsmodels\base\optimizer.py in _fit_newton(f, score, start_params, fargs, kwargs, disp, maxiter, callback, retall, full_output, hess, ridge_factor)
    246             history.append(newparams)
    247         if callback is not None:
--> 248             callback(newparams)
    249         iterations += 1
    250     fval = f(newparams, *fargs)  # this is the negative likelihood

C:\Program Files\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py in _check_perfect_pred(self, params, *args)
    184                 np.allclose(fittedvalues - endog, 0)):
    185             msg = "Perfect separation detected, results not available"
--> 186             raise PerfectSeparationError(msg)
    187 
    188     def fit(self, start_params=None, method='newton', maxiter=35,

PerfectSeparationError: Perfect separation detected, results not available

Solution

  • You have perfect separation, meaning that your data is perfectly separable by a hyperplane. When this happens, the maximum likelihood estimate for your parameters is infinite, hence your error.

    Example of perfect separation:

    Gender   Outcome  
    male     1
    male     1
    male     0
    female   0
    female   0
    

    In this case, if I get a female observation, I know with 100% certainty that the outcome will be 0. That is, my data perfectly separates the outcomes. There is no uncertainty, and the numerical calculation for finding my coefficients doesn't converge.

    According to your error, something similar is happening to you. With just 10 entries, you can imagine how this is likely to happen, vs having, say 1000 entries or something like that. So get more data :)