Search code examples
rstatisticsemmeans

How to combine groups in Poisson regression to estimate contrast?


I'm not sure if this is more a programming or statistical (i.e. my lack of understanding) question.

I have a Poisson mixed model that I want to use to compare average counts across groups at different time periods.

mod <- glmer(Y ~ TX_GROUP * time + (1|ID), data = dat, family = poisson)
mod_em <- emmeans(mod, c("TX_GROUP","time"), type = "response")

 TX_GROUP time     rate        SE  df asymp.LCL asymp.UCL
 0        1    5.743158 0.4566671 Inf  4.914366  6.711723
 1        1    5.529303 0.4639790 Inf  4.690766  6.517741
 0        2    2.444541 0.2981097 Inf  1.924837  3.104564
 1        2    1.467247 0.2307103 Inf  1.078103  1.996855
 0        3    4.570218 0.4121428 Inf  3.829795  5.453790
 1        3    1.676827 0.2472920 Inf  1.255904  2.238826

Now, I want to estimate the marginal count for the combined time period (2 + 3) for each group. Is it not a simple case of exponentiating the sum of the logged counts from:

contrast(mod_em, list(`2 + 3` = c(0, 0, 1, 0, 1, 0)))
contrast(mod_em, list(`2 + 3` = c(0, 0, 0, 1, 0, 1)))

If I try that the value does not come close to matching the simple mean of the combined groups.


Solution

  • First, I suggest that you put both of your contrasts in one list, e.g.,

    contr = list(`2+2|0` = c(0, 0, 1, 0, 1, 0),
                 `2+3|1` = c(0, 0, 0, 1, 0, 1))
    

    You have to decide when you want to back-transform. See the vignette on transformations and note the discussion on "timing is everything". The two basic options are:

    One option: Obtain the marginal means of the log counts, and then back-transform:

    mod_con = update(contrast(mod_emm, contr), tran = "log")
    summary(mod_con, type = "response")
    

    [The update call is needed because contrast strips off transformations except in special cases, because it doesn't always know what scale to assign to arbitrary linear functions. For example, the difference of two square roots is not on a square-root scale.]

    Second option: Back-transform the predictions, then sum them:

    mod_emmr = regrid(mod_emm) 
    contrast(mod_emmr, contr)
    

    The distinction between these results is the same as the distinction between a geometric mean (option 1) and an arithmetic mean (option 2). I doubt that either of them will yield the same results as the raw marginal mean counts, because they are based on the predictions from your model. Personally, I think the first option is the better choice, because sums are a linear operation, and the model is linear on the log scale.

    Addendum

    There is actually a third option, which is to create a grouping variable. I will illustrate with the pigs dataset.

    > pigs.lm <- lm(log(conc) ~ source + factor(percent), data = pigs)
    

    Here are the EMMs for percent:

    > emmeans(pigs.lm, "percent")
     percent   emmean         SE df lower.CL upper.CL
           9 3.445307 0.04088810 23 3.360723 3.529890
          12 3.624861 0.03837600 23 3.545475 3.704248
          15 3.662706 0.04372996 23 3.572244 3.753168
          18 3.745156 0.05296030 23 3.635599 3.854713
    
    Results are averaged over the levels of: source 
    Results are given on the log (not the response) scale. 
    Confidence level used: 0.95 
    

    Now let's create a grouping factor group:

    > pigs.emm = add_grouping(ref_grid(pigs.lm), "group", "percent", c("1&2","1&2","3&4","3&4"))
    > str(pigs.emm)
    'emmGrid' object with variables:
        source = fish, soy, skim
        percent =  9, 12, 15, 18
        group = 1&2, 3&4
    Nesting structure:  percent %in% group
    Transformation: “log” 
    

    Now get the EMMs for group and note they are just the averages of the respective levels:

    > emmeans(pigs.emm, "group")
     group   emmean         SE df lower.CL upper.CL
     1&2   3.535084 0.02803816 23 3.477083 3.593085
     3&4   3.703931 0.03414907 23 3.633288 3.774574
    
    Results are averaged over the levels of: source, percent 
    Results are given on the log (not the response) scale. 
    Confidence level used: 0.95 
    

    And here is a summary on the response scale:

    > summary(.Last.value, type = "response")
     group response       SE df lower.CL upper.CL
     1&2   34.29790 0.961650 23 32.36517 36.34605
     3&4   40.60662 1.386678 23 37.83703 43.57893
    
    Results are averaged over the levels of: source, percent 
    Confidence level used: 0.95 
    Intervals are back-transformed from the log scale
    

    These are averages rather than sums, but otherwise it works, and the transformation doesn't get zapped like it does in contrast()