Search code examples
rglmlogistic-regression

How come I get this logistic regression error in glm/glm2 if I don't exhibit linear separation in my data?


I started running into the error (converted from warning):

glm.fit (or glm.fit2): fitted probabilities numerically 0 or 1 occurred

I found this link referencing linear separation of data:

[R] glm.fit: "fitted probabilities numerically 0 or 1 occurr

So I tried hunting through the data and found a small reproducible example from a small subset of the data (both glm and glm2) where I don't actually see the linear separation and yet I get the error:

response = c(0,1,0,1,0,0,0,0,0,0)
dependent = c(133,571,1401,4930,3134075,44357054,1718619387,1884020779,8970035092,9392823637)
foo = data.frame(y=response,x=dependent)
glm(y ~ x, family=binomial, data=foo)

I can avoid the issue by transforming the dependent via log(x+1), however, this is monotonic and doesn't alter the ordering so I'm not sure why that helps and whether I should be doing so. The dependents are "microseconds since the last time some event happened" which is why some values can be large. I tried turning it into a two level factor of (recent, not recent) but that loses information and underperforms the raw values.


Solution

  • I think this is just a feature of the data and the rounding of the floating point calculations going on in the optimization of the maximum likelihood function.

    Take a look at the fitted values of the log transformed set:

    > response = c(0,1,0,1,0,0,0,0,0,0)
    > dependent = c(133,571,1401,4930,3134075,44357054,1718619387,1884020779,8970035092,9392823637)
    > 
    > foo = data.frame(y=response,x=log(dependent))
    > mlog <- glm(y ~ x, family=binomial, data=foo)
    > mlog$fitted
              1           2           3           4 
    0.584089292 0.484155299 0.422713978 0.340825478 
              5           6           7           8 
    0.079815887 0.040011202 0.014931996 0.014562755 
              9          10 
    0.009506656 0.009387457 
    

    Whereas the untransformed set results in the occurance miniscule fitted values:

    > foo = data.frame(y=response,x=dependent)
    > m <- glm(y ~ x, family=binomial, data=foo)
    Warning message:
    glm.fit: fitted probabilities numerically 0 or 1 occurred 
    > m$fitted.values
               1            2            3 
    5.007959e-01 5.005387e-01 5.000511e-01 
               4            5            6 
    4.979784e-01 6.359085e-04 2.220446e-16 
               7            8            9 
    2.220446e-16 2.220446e-16 2.220446e-16 
              10 
    2.220446e-16 
    

    Doesn't seem to be a warning related to complete (or quasi) separation of the data. I think the warning is pretty informative in this case.