Search code examples
rgraphstatisticsglmmle

Why are the predicted values of my GLM cyclical?


I wrote a binomial regression model to predict the prevalence of igneous stone, v, at an archaeological site based on proximity to a river, river_dist, but when I use the predict() function I'm getting odd cyclical results instead of the curve I was expecting. For reference, my data:

    v   n river_dist
1 102 256       1040
2   1  11        720
3  19  24        475
4  12  15        611

Which I fit to this model:

library(bbmle)
m_r <- mle2(ig$v ~ dbinom(size=ig$n, prob = 1/(1+exp(-(a + br * river_dist)))),
    start = list(a = 0, br = 0), data = ig)

This produces a coefficient which, when back-transformed, suggests about 0.4% decrease in the likelihood of igneous stone per meter from the river (br = 0.996):

exp(coef(m_r))

That's all good. But when I try to predict new values, I get this odd cycling of values:

newdat <- data.frame(river_dist=seq(min(ig$river_dist), max(ig$river_dist),len=100))
newdat$v <- predict(m_r, newdata=newdat, type="response")
plot(v~river_dist, data=ig, col="red4")
lines(v ~ river_dist, newdat, col="green4", lwd=2)

Example of predicted values:

   river_dist          v
1     475.0000 216.855114
2     480.7071   9.285536
3     486.4141  20.187424
4     492.1212  12.571487
5     497.8283 213.762248
6     503.5354   9.150584
7     509.2424  19.888471
8     514.9495  12.381805
9     520.6566 210.476312
10    526.3636   9.007289
11    532.0707  19.571218
12    537.7778  12.180629

Why are the values cycling up and down like that, creating crazy spikes when graphed?


Solution

  • In order for newdata to work, you have to specify the variables as 'raw' values rather than with $:

    library(bbmle)
    m_r <- mle2(v ~ dbinom(size=n, prob = 1/(1+exp(-(a + br * river_dist)))),
        start = list(a = 0, br = 0), data = ig)
    

    At this point, as @user20650 suggests, you'll also have to specify a value (or values) for n in newdata.

    This model appears to be identical to binomial regression: is there a reason not to use

    glm(cbind(v,n-v) ~ river_dist, data=ig, family=binomial) 
    

    ? (bbmle:mle2 is more general, but glm is much more robust.) (Also: fitting two parameters to four data points is theoretically fine, but you should not try to push the results too far ... in particular, a lot of the default results from GLM/MLE are asymptotic ...)

    Actually, in double-checking the correspondence of the MLE fit with GLM I realized that the default method ("BFGS", for historical reasons) doesn't actually give the right answer (!); switching to method="Nelder-Mead" improves things. Adding control=list(parscale=c(a=1,br=0.001)) to the argument list, or scaling the river dist (e.g. going from "1 m" to "100 m" or "1 km" as the unit), would also fix the problem.

    m_r <- mle2(v ~ dbinom(size=n,
            prob = 1/(1+exp(-(a + br * river_dist)))),
                start = list(a = 0, br = 0), data = ig,
                method="Nelder-Mead")
    pframe <- data.frame(river_dist=seq(500,1000,length=51),n=1)
    pframe$prop <- predict(m_r, newdata=pframe, type="response")
    CIs <- lapply(seq(nrow(ig)),
                  function(i) prop.test(ig[i,"v"],ig[i,"n"])$conf.int)
    ig2 <- data.frame(ig,setNames(as.data.frame(do.call(rbind,CIs)),
                  c("lwr","upr")))
    library(ggplot2); theme_set(theme_bw())
    ggplot(ig2,aes(river_dist,v/n))+
        geom_point(aes(size=n)) +
        geom_linerange(aes(ymin=lwr,ymax=upr)) +
        geom_smooth(method="glm",
                    method.args=list(family=binomial),
                  aes(weight=n))+
        geom_line(data=pframe,aes(y=prop),colour="red")
    

    enter image description here

    Finally, note that your third-farthest site is an outlier (although the small sample size means it doesn't hurt much).