There are lots of posts here about the "Perfect Separation Error"
in statsmodels when running a logisitc regression. But I'm not doing logistic regression. I'm doing GLM with frequency weights and gaussian distribution. So basically OLS.
All of my independent variables are categorical with lots of categories. So high dimensional binary coded feature set.
But I'm very frequently getting the "perfectseperationerror" from statsmodels
I'm running many many models. I think I'm getting this error when my data is too thin for that many variables. However, With freq weights, in theory, I actually have many more features then the dataframe holds because the observations should be multiplied by the freq.
Any guidance on how to proceed?
reg = sm.GLM(dep, Indies, freq_weights = freq)
<p>Error: class 'statsmodels.tools.sm_exceptions.PerfectSeparationError'>
The check is on perfect prediction and is used independently of the family.
Currently, there is now workaround when using irls
. Using scipy optimizers, e.g. method="bfgs"
, avoids the perfect prediction/separation check.
https://github.com/statsmodels/statsmodels/issues/2680
Perfect separation is only defined for the binary case, i.e. family binomial in GLM, and could be extended to other discrete models.
However, there can be other problems with inference if the residual variance is zero, i.e. we have a perfect fit.
Here is an issue with perfect prediction in OLS
https://github.com/statsmodels/statsmodels/issues/1459