Search code examples
rmissing-datalme4multi-level

Discrepancy between lme4 sample size and complete case dataset?


I'm currently estimating a hierarchical linear model (HLM) using lme4. My entire dataset has 367 observations. lme4 estimated my model using 341 observations - I assumed some were dropped due to missing data. However, when I sum the complete cases on the model's variables, I end up with 337 observations. This is making it difficult to test for assumptions when the model is a different length than the dataset.

There is a discrepancy between complete cases and the observations 'used' by lme4.

  1. Why would lme4 use 4 non-complete cases?
  2. How would I find out what exact observations (as ID #s) are being used by lme4?

As described, I tried to remove missing data from my main dataset assuming lme4 drops cases listwise. I've tried checking each variable for its missingness to see if lme4 was just testing a certain variable, but none match up with lme4's output estimate of 341.

If needed, I can provide the anonymized dataset - but hoping there's something easy I'm not aware of!


Solution

  • The most obvious reason would be that lme4 (and internally model.frame) assesses completeness only on the basis of the variables that are actually used in the model. Do you have NA values in variables that are not included in the model formula?

    (For what it's worth, this default also means that it's a good idea to filter the full data set for complete cases first if you are going to fit a series of models with different subsets of predictors and want the models to be comparable ...)

    Example:

    library(lme4)
    ss <- sleepstudy
    ss$Days[c(1, 4, 10)] <- NA
    ss$extra <- 1
    ss$extra[c(4, 20, 25)] <- NA
    
    nrow(ss)  ## 180
    sum(complete.cases(ss))  ## 175; drops obs with missing `extra`
    m <- lmer(Reaction ~ Days + (1|Subject), ss, na.action = na.omit)
    mf <- model.frame(m)
    nrow(mf)  ## 177; ignores variables not in the model
    

    Which rows were deleted? Two ways to check:

    attr(mf, "na.action")
     1  4 10 
     1  4 10 
    attr(,"class")
    [1] "omit"
    
    setdiff(rownames(ss), rownames(mf))
    [1] "1"  "4"  "10"
    

    The equivalent to the internal code is

    nrow(model.frame(Reaction ~ Days + Subject, ss))  ## 177
    

    (the formula gets processed so that the random effects grouping variables are also included in the formula for the purpose of excluding incomplete cases)