I am trying to implement a logistic regression using statsmodels (I need the summary) and I get this error:
LinAlgError: Singular matrix
My df is numeric and correlated, I deleted the non-numeric and constant features. I tried to implement regular regression as well as one with l1 penalty (l2 isn't available) because of the correlated features.
I tried to check the matrix rank and got this print:
print(len(df.columns)) -> 156
print(np.linalg.matrix_rank(df.values)) -> 151
How do I know which features are a problem and why?
my code:
logit = sm.Logit(y,X)
result = logit.fit_regularized(trim_mode='auto', alpha=0,maxiter=150)
print(result.summary())
Update:
after removing highly correlated features I get:
len(df.columns) = np.linalg.matrix_rank(df.values)
but still the same error. (even if I set a low correlation threshold).
I tried to change the solver as well.
As suggested in the comments, if two features are exactly correlated the model won't run. The easiest way to check this if you have a pandas dataframe with a small number of columns is to call the .corr() method on your dataframe - in this case df.corr(), and check if any pair of features have correlation =1.
You should really think about why some features are perfectly correlated though.