Search code examples
rplotlogistic-regressionpredictconfidence-interval

How to plot logistic glm predicted values and confidence interval in R


I have a binomial glm of a presence/absence response variable and a factor variable with 9 levels like this:

data$y<-factor(data$y,levels=c(0,1),labels=c("absent","present"))
table(data$y,data$site_name)

          Andulay Antulang Basak Dauin Poblacion District 1 Guinsuan Kookoo's Nest Lutoban Pier Lutoban South Malatapay Pier
  absent        4        4     1                          0        3             1            5             5              2
  present       2        2     5                          6        3             5            1             1              4

model <- glm(y~site_name,data=data,binomial)

Just skipping the model inference and validation for brevity's sake, how do I plot per site a probability of getting "present" in a boxplot with its confidence interval? What I would like is kind of what is shown in Plot predicted probabilities and confidence intervals in R but I would like to show it with a boxplot, as my regression variable site_name is a factor with 9 levels, not a continuous variable.

I think I can calculate the necessary values as follows (but am not 100% sure about the correctness):

Function to convert the model coefficients back to probabilities of success:

calc_val <- function(x){return(round(1/(1+1/(exp(x))),3))}

Predicted probabilities based on the model:

prob <- tapply(predict(model,type="response"),data$site_name,function(x){round(mean(x),3)})
means <- as.data.frame(prob)

75% and 95% confidence intervals for the predicted probabilities:

ci <- cbind(confint(model,level=0.9),confint(model,level=0.5))
rownames(ci) <- gsub("site_name","",rownames(ci))
ci <- t(apply(ci,1,calc_val))

Join it all together in one table

ci<-cbind(means,ci)
ci
                            prob   5 %  95 %  25 %  75 %   Pr(>|z|) stderr
Andulay                    0.333 0.091 0.663 0.214 0.469 0.42349216  0.192
Antulang                   0.333 0.112 0.888 0.304 0.696 1.00000000  0.192
Basak                      0.833 0.548 0.993 0.802 0.964 0.09916496  0.152
Dauin Poblacion District 1 1.000 0.000    NA 0.000 1.000 0.99097988  0.000
Guinsuan                   0.500 0.223 0.940 0.474 0.819 0.56032414  0.204
Kookoo's Nest              0.833 0.548 0.993 0.802 0.964 0.09916496  0.152
Lutoban Pier               0.167 0.028 0.788 0.130 0.501 0.51171512  0.152
Lutoban South              0.167 0.028 0.788 0.130 0.501 0.51171512  0.152
Malatapay Pier             0.667 0.364 0.972 0.640 0.903 0.25767454  0.192

So my questions are twofold:

  1. Is the calculation of probability and confidence interval correct?
  2. How do I plot this in a bloxplot (box and whiskers plot)?

EDIT Here is some sample data via dput (which also modified the tables above to match the data):

# dput(data[c("y", "site_name")])
data <- structure(list(y = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("absent", "present"), class = "factor"), site_name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 9L, 9L, 9L, 9L, 9L, 9L, 4L, 4L, 4L, 4L, 4L, 4L, 8L, 8L, 8L, 8L, 8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("Andulay", "Antulang", "Basak", "Dauin Poblacion District 1", "Guinsuan", "Kookoo's Nest", "Lutoban Pier", "Lutoban South", "Malatapay Pier"), class = "factor")), .Names = c("y", "site_name"), row.names = c(125L, 123L, 126L, 124L, 128L, 127L, 154L, 159L, 157L, 158L, 156L, 155L, 111L, 114L, 116L, 115L, 112L, 113L, 152L, 151L, 148L, 150L, 153L, 149L, 143L, 146L, 144L, 147L, 142L, 145L, 164L, 165L, 161L, 163L, 160L, 162L, 120L, 122L, 121L, 117L, 118L, 119L, 137L, 136L, 139L, 141L, 140L, 138L, 129L, 134L, 131L, 135L, 133L, 130L), class = "data.frame")
#

Solution

  • This is a lowest-common-denominator, base-package-only, solution.

    Fit the model:

    mm <- glm(y~site_name,data=dd,family=binomial)
    

    Make up a prediction frame with the site names:

    pframe <- data.frame(site_name=unique(dd$site_name))
    

    Predict (on the logit/linear-predictor scale), with standard errors

    pp <- predict(mm,newdata=pframe,se.fit=TRUE)
    linkinv <- family(mm)$linkinv ## inverse-link function
    

    Put together the prediction, lower and upper bounds, and back-transform to the probability scale:

    pframe$pred0 <- pp$fit
    pframe$pred <- linkinv(pp$fit)
    alpha <- 0.95
    sc <- abs(qnorm((1-alpha)/2))  ## Normal approx. to likelihood
    alpha2 <- 0.5
    sc2 <- abs(qnorm((1-alpha2)/2))  ## Normal approx. to likelihood
    pframe <- transform(pframe,
                        lwr=linkinv(pred0-sc*pp$se.fit),
                        upr=linkinv(pred0+sc*pp$se.fit),
                        lwr2=linkinv(pred0-sc2*pp$se.fit),
                        upr2=linkinv(pred0+sc2*pp$se.fit))
    

    Plot.

    with(pframe,
    {
        plot(site_name,pred,ylim=c(0,1))
        arrows(as.numeric(site_name),lwr,as.numeric(site_name),upr,
               angle=90,code=3,length=0.1)
    })
    

    As a boxplot:

    with(pframe,
    {
        bxp(list(stats=rbind(lwr,lwr2,pred,upr2,upr),
                 n = rep(1,nrow(pframe)),
                 conf = NA,
                 out = NULL,
                 group = NULL,
                 names=as.character(site_name)))
    })
    

    There are lots of other ways to do this; I would recommend

    library("ggplot2")
    ggplot(pframe,aes(site_name,pred))+
         geom_pointrange(aes(ymin=lwr,ymax=upr))+
         geom_linerange(aes(ymin=lwr2,ymax=upr2),lwd=1.5)+
         coord_flip()
    

    An alternative solution is to fit the model via y~site_name-1, which in this case will assign a separate parameter to the probability of each site, and use profile()/confint() to find the confidence intervals; this will be slightly more accurate than relying on the Normality of the sampling distributions of the parameters/predictions as done in the answer above.