Search code examples
roffsetglmemmeansmass

Negative Binomial model offset seems to be creating a 2 level factor


I am trying to fit some data to a negative binomial model and run a pairwise comparison using emmeans. The data has two different sample sizes, 15 and 20 (num_sample in the example below).
I have set up two data frames: good.data which produces the expected result of offset() using random sample sizes between 15 and 20, and bad.data using a sample size of either 15 or 20, which seems to produce a factor of either 15 or 20. The bad.data pairwise comparison produces way too many comparisons compared to the good.data, even though they should produce the same number?

set.seed(1)
library(dplyr)
library(emmeans)
library(MASS)
# make data that works
data.frame(site=c(rep("A",24),
                  rep("B",24),
                  rep("C",24),
                  rep("D",24),
                  rep("E",24)),
           trt_time=rep(rep(c(10,20,30),8),5),
           pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
           storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
           num_sample=sample(c(15,17,20),24*5,T),# more than 2 sample sizes...
           bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->good.data
# make data that doesn't work
data.frame(site=c(rep("A",24),
                  rep("B",24),
                  rep("C",24),
                  rep("D",24),
                  rep("E",24)),
           trt_time=rep(rep(c(10,20,30),8),5),
           pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
           storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
           num_sample=sample(c(15,20),24*5,T),# only 2 sample sizes...
           bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->bad.data
# fit models
good.data%>%
  mutate(trt_time=factor(trt_time),
         pre_trt=factor(pre_trt),
         storage_time=factor(storage_time))%>%
  MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
               data=.)->mod.good
bad.data%>%
  mutate(trt_time=factor(trt_time),
         pre_trt=factor(pre_trt),
         storage_time=factor(storage_time))%>%
  MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
               data=.)->mod.bad
  
# pairwise comparison
emmeans::emmeans(mod.good,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
emmeans::emmeans(mod.bad,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()

Solution

  • First , I think you should look up how to use emmeans.The intent is not to give a duplicate of the model formula, but rather to specify which factors you want the marginal means of.

    However, that is not the issue here. What emmeans does first is to setup a reference grid that consists of all combinations of

    • the levels of each factor
    • the average of each numeric predictor; except if a numeric predictor has just two different values, then both its values are included.

    It is that exception you have run against. Since num_samples has just 2 values of 15 and 20, both levels are kept separate rather than averaged. If you want them averaged, add cov.keep = 1 to the emmeans call. It has nothing to do with offsets you specify in emmeans-related functions; it has to do with the fact that num_samples is a predictor in your model.

    The reason for the exception is that a lot of people specify models with indicator variables (e.g., female having values of 1 if true and 0 if false) in place of factors. We generally want those treated like factors rather than numeric predictors.