Search code examples
pythonmatrixlogistic-regressionsummarymatrix-inverse

logistic regression using statsmodels error in python


I am trying to implement a logistic regression using statsmodels (I need the summary) and I get this error:

LinAlgError: Singular matrix

My df is numeric and correlated, I deleted the non-numeric and constant features. I tried to implement regular regression as well as one with l1 penalty (l2 isn't available) because of the correlated features.

I tried to check the matrix rank and got this print:

print(len(df.columns)) -> 156

print(np.linalg.matrix_rank(df.values)) -> 151

How do I know which features are a problem and why?

my code:

logit = sm.Logit(y,X)

result = logit.fit_regularized(trim_mode='auto', alpha=0,maxiter=150)

print(result.summary())

Update:

after removing highly correlated features I get:

  len(df.columns) =  np.linalg.matrix_rank(df.values)

but still the same error. (even if I set a low correlation threshold).

I tried to change the solver as well.


Solution

  • As suggested in the comments, if two features are exactly correlated the model won't run. The easiest way to check this if you have a pandas dataframe with a small number of columns is to call the .corr() method on your dataframe - in this case df.corr(), and check if any pair of features have correlation =1.

    You should really think about why some features are perfectly correlated though.