I am unsure about what to do in the following situation, where some levels of the fixed effect are missing (within a random effect) - they are unbalanced.
Imagine an aquarium with 5,000 individual fish. They are part of 100 different species. I want to test if there is a relationship between their weight (continuous) and whether they are fed by Alan or Susie (there only are two employees that feed fish). Species is the random effect.
My model looks like this: weight ~ employee + (1 + employee | species)
: mixed model (lmer
) with random intercept and slope.
But for some species, all fish are fed by the same employee (Alan or Susie). Should I leave these observations in the model, or should I exclude them? Is there some literature on this?
This should be fine. Mixed models are well suited to this kind of missingness, unless it's really extreme (e.g. there were no species, or very few, that were measured by both employees). A small made-up example is below.
The cases where employee 1's measurements were missing have slightly wider confidence intervals; the cases where employee 2's measurements are missing have considerably wider CIs on the employee-2 effect (not sure why these aren't exactly zero, but my guess is that it has to do with the particular random effects values simulated - i.e. the random effects have zero mean overall, so these may be slightly >0 to make the overall estimates balance ... ?)
n_emp <- 2
n_spp <- 10
n_rep <- 20
dd <- expand.grid(emp = factor(seq(n_emp)),
spp = factor(seq(n_spp)),
rep = seq(n_rep))
dd2 <- subset(dd,
!((emp=="1" & (spp %in% 1:2)) |
(emp=="2" & (spp %in% 3:4))))
library(lme4)
form <- weight ~ emp + (1 + emp | spp)
## BUG/edge case: form[-2] breaks?
dd2$weight <- simulate(~ emp + (1 + emp | spp),
seed = 101,
newdata = dd2,
newparams = list(beta = c(1,1),
theta = c(1,1,1),
sigma = 1),
family = gaussian)[[1]]
m <- lmer(form, data = dd2)
rr <- as.data.frame(ranef(m))
rr$miss <- with(rr,
ifelse(grp %in% 1:2, "miss1",
ifelse(grp %in% 3:4, "miss2",
"nomiss")))
library(ggplot2)
ggplot(rr, aes(y = grp, x = condval, xmin = condval-2*condsd,
xmax = condval + 2*condsd, colour = miss)) +
geom_pointrange() +
facet_wrap(~term)