Search code examples
rglmemmeans

Unexpected behavior of emmeans with respect to model specification and confidence intervals


My data are integers containing many zeros. I want to model the zeros separately using a binomial generalized linear model. In the model statement I specified Y>0 on the left-hand side of the tilde, which gives me a binary (TRUE, FALSE) vector. I further analyzed the data using the emmeans package specifying (type = "response"). Then I realized (on my actual data) that the confidence intervals seemed off. I tried trouble-shooting this and decided to create a new variable containing the TRUE and FALSE values in my data frame separately. This fixed the problem. Why is this happening?

Below is the code that reproduces this behavior (albeit the effect of this doesn't come through as pronounced as in my original data set):

require(emmeans)
# example data
d <- structure(list(X = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L
), .Label = c("A", "B", "C", "D"), class = "factor"), Y = c(0L, 
4L, 4L, 5L, 6L, 5L, 6L, 7L, 8L, 9L, 0L, 0L, 3L, 4L, 1L, 5L, 2L, 
3L, 2L, 1L, 0L, 0L, 0L, 0L, 0L, 12L, 11L, 6L, 8L, 11L, 0L, 0L, 
0L, 0L, 0L, 12L, 13L, 11L, 12L, 16L)), class = "data.frame", row.names = c(NA, 
-40L))

# add additional variable - set every value > 0 to TRUE, otherwise FALSE
d$no0 <- d$Y>0 

Here is first model using the relational operator >in the model:

# binomial GLM using `Y>0` on the left side
m1 <- glm(Y>0 ~ X, family = binomial(), d)
summary(m1)

Call:
glm(formula = Y > 0 ~ X, family = binomial(), data = d)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1460  -1.1774   0.4590   0.7954   1.1774  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   2.1972     1.0540   2.085   0.0371 *
XB           -0.8109     1.3175  -0.615   0.5382  
XC           -2.1972     1.2292  -1.788   0.0739 .
XD           -2.1972     1.2292  -1.788   0.0739 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 50.446  on 39  degrees of freedom
Residual deviance: 44.236  on 36  degrees of freedom
AIC: 52.236

Number of Fisher Scoring iterations: 4

Here is the second model using the new variable:

# binomial GLM using variable no0
m2 <- glm(no0 ~ X, family = binomial(), d)
summary(m2)

Call:
glm(formula = no0 ~ X, family = binomial(), data = d)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1460  -1.1774   0.4590   0.7954   1.1774  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   2.1972     1.0540   2.085   0.0371 *
XB           -0.8109     1.3175  -0.615   0.5382  
XC           -2.1972     1.2292  -1.788   0.0739 .
XD           -2.1972     1.2292  -1.788   0.0739 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 50.446  on 39  degrees of freedom
Residual deviance: 44.236  on 36  degrees of freedom
AIC: 52.236

Number of Fisher Scoring iterations: 4

So far the outputs are identical. Then I moved on a ran the emmeans() function for model 1 and model 2 without the type = "response" argument:

(em1 <- emmeans(m1, ~ X))
 X emmean    SE  df asymp.LCL asymp.UCL
 A   2.20 1.054 Inf     0.131      4.26
 B   1.39 0.791 Inf    -0.163      2.94
 C   0.00 0.632 Inf    -1.240      1.24
 D   0.00 0.632 Inf    -1.240      1.24

Results are given on the logit (not the response) scale. 
Confidence level used: 0.95 

(em2 <- emmeans(m2, ~ X))
 X emmean    SE  df asymp.LCL asymp.UCL
 A   2.20 1.054 Inf     0.131      4.26
 B   1.39 0.791 Inf    -0.163      2.94
 C   0.00 0.632 Inf    -1.240      1.24
 D   0.00 0.632 Inf    -1.240      1.24

Results are given on the logit (not the response) scale. 
Confidence level used: 0.95 

Again all good. But when I add the type = response argument all looks good except the confidence intervals are different (compare both outputs below):

(em3 <- emmeans(m1, ~ X, type = "response"))
 X response     SE  df asymp.LCL asymp.UCL
 A      0.9 0.0949 Inf     0.714      1.09
 B      0.8 0.1265 Inf     0.552      1.05
 C      0.5 0.1581 Inf     0.190      0.81
 D      0.5 0.1581 Inf     0.190      0.81

Unknown transformation ">": no transformation done 
Confidence level used: 0.95 

(em4 <- emmeans(m2, ~ X, type = "response"))
 X prob     SE  df asymp.LCL asymp.UCL
 A  0.9 0.0949 Inf     0.533     0.986
 B  0.8 0.1265 Inf     0.459     0.950
 C  0.5 0.1581 Inf     0.225     0.775
 D  0.5 0.1581 Inf     0.225     0.775

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

I see that there is a warning in the first output (Unknown transformation ">": no transformation done) but why does it affect only the confidence intervals?

Another interesting observation is that when I plot the emmeans objects without the comparisons = T argument in the plot() function it matches the em3 and em4 outputs above with the different confidence intervals:

p1 <- plot(em3, comparisons = F) + scale_x_continuous(limits = c(0,1.1)) + ggtitle("Y>0 ~.; and comparisons = F")
p2 <- plot(em4, comparisons = F) + scale_x_continuous(limits = c(0,1.1)) + ggtitle("no0 ~.; and comparisons = F")
gridExtra::grid.arrange(p1, p2, nrow = 2)

enter image description here

But when I add the comparisons = T argument, the confidence intervals are now the same, however, both match the model that is based on the Y>0 specification in the model (see m3, and em3)

p3 <- plot(em3, comparisons = T) + scale_x_continuous(limits = c(0,1.1)) + ggtitle("Y>0 ~.; and comparisons = T")
p4 <- plot(em4, comparisons = T) + scale_x_continuous(limits = c(0,1.1))+ ggtitle("no0 ~.; and comparisons = T")
gridExtra::grid.arrange(p3, p4, nrow = 2)

enter image description here

This was a bit lengthy but my question boils down to:

Can I use the Y>0 ~ X model specification in combination when using emmeans, or should I first create a separate variable for this?


Solution

  • What is happening is that emmeans allows for situations where there is both a response transformation and a link function. This can be handy, say, when you fit a model with a gamma family, inverse link, and a square root response transformation. However, in this case, the > is taken as a response transformation:

    > emm1 <- emmeans(m1, "X")
    
    > str(emm1)
    'emmGrid' object with variables:
        X = A, B, C, D
    Transformation: “logit” 
    Additional response transformation: “>” 
    

    When you specify type = "response", summary.emmGrid() attempts to undo both transformations -- i.e., tries to put it on the Y scale. You can undo just the link function as follows:

    > confint(emm1, type = "unlink")
     X response     SE  df asymp.LCL asymp.UCL
     A      0.9 0.0949 Inf     0.533     0.986
     B      0.8 0.1265 Inf     0.459     0.950
     C      0.5 0.1581 Inf     0.225     0.775
     D      0.5 0.1581 Inf     0.225     0.775
    
    Confidence level used: 0.95 
    Intervals are back-transformed from the logit scale 
    

    ... or by removing the second transformation:

    > emm1a <- update(emm1, tran2 = NULL)
    > confint(emm1a, type = "response")
     X response     SE  df asymp.LCL asymp.UCL
     A      0.9 0.0949 Inf     0.533     0.986
     B      0.8 0.1265 Inf     0.459     0.950
     C      0.5 0.1581 Inf     0.225     0.775
     D      0.5 0.1581 Inf     0.225     0.775
    
    Confidence level used: 0.95 
    Intervals are back-transformed from the logit scale 
    

    In both cases, the confidence intervals here were computed on the link scale, then back-transformed. The other confidence limits you see here were obtained with these steps reversed, i.e., using the standard errors of the back-transformed results:

    > confint(regrid(emm1, transform = "unlink"))
     X response     SE  df asymp.LCL asymp.UCL
     A      0.9 0.0949 Inf     0.714      1.09
     B      0.8 0.1265 Inf     0.552      1.05
     C      0.5 0.1581 Inf     0.190      0.81
     D      0.5 0.1581 Inf     0.190      0.81
    
    Results are given on the > (not the response) scale. 
    Confidence level used: 0.95 
    

    I will consider if there are changes that can be made that would reliably determine when a response transformation is clearly not intended.