Search code examples
rdata.tabledata-analysislme4

issue with as.formula() or formula() in lmer within "by" of data.table


Specifically, when I pass a model (as a string) in lmer via as.formula() or formula() within the data.table code using the "by", it seems the entire data is being analyzed instead of stratified analysis by the "sim" variable. Sample Code below explains the issue I am facing.

Step1. I created a mock data as follows. The data below is not really longitudinal, but should suffice for the purposes of a demo of the issue I am facing.

library(data.table)
library(lme4)

# ----  Step1. multiple simulated data ----
Nsim=2
n=20
csize=4

set.seed(10)
y <- round(rnorm(Nsim*n*csize, 0, 3),2)
time <- rep(0:(csize-1),Nsim*n) 
w <- rbinom(Nsim*n,size=1,prob=0.6); w<- rep(w, each=csize)
x<- round(2*runif(Nsim*n*csize)+1,2)

id = rep(rep(1:n,each=csize),Nsim)
sim=rep(1:Nsim,each=n*csize)

dat<- data.table(y, time, w, x, id, sim)
setkey(dat,sim)
# ----  End of (multiple simulated data) ----

Output from Step 1

Step 2. Forcefully running lmer model for each sim to see what the true results should look like.

cat('# -----------  Step 2. output we should expect  -----------\n') 
BIC(lmer(y~time+w+x+(1+time|id), data=dat[sim==1,]))
BIC(lmer(y~time+w+x+(1+time|id), data=dat[sim==2,]))

Output from Step 2

Step 3. Getting the above results using stratified analysis through by of data.table. Results from Step 2 and Step 3 should be the same and they are.

cat('# ------  Step 3. Stratified results using "by" of data.table ------\n')
dat[,BIC(lmer(y~time+w+x+(1+time|id))), by=sim]

Output from Step 3

Step 4. Running the stratified analysis by passing string-model in lmer through as.formula() or formula(). This code analyzes the whole data instead of stratified analysis by sim (see the results from Step 5)

cat('# -----  Step 4. trying "as.formula()" or "formula()" in data.table  -----\n')
dat[,{
   use.form<- as.formula('y~time+w+x+(1+time|id)')
   BIC(lmer(use.form))
   }, by=sim]

Output from Step 4

Step 5. I confirmed results from Step 4 are indeed incorrect since it ignores stratification by sim, as shown below.

cat('# ---------- Step 5.  output NOT expected ------------------\n')
BIC(lmer(y~time+w+x+(1+time|id), data=dat)),'\n')

Output from Step 5

My code in Step 4 is definitely missing an important aspect of data.table and its by process. Any guidance from the community will be much appreciated.

Also, this is my first ever post here. Apologies if I missed something in terms of creating a good post.


Solution

  • @RichieSacramento points out that this is fixed by passing the data explicitly:

    Passing .SD to the data argument of the model fixes the issue. dat[, BIC(lmer('y~time+w+x+(1+time|id)', .SD)), by=sim].

    I can't say exactly what's going on without spending a lot more time digging, but evaluation of formulas in lme4 is rather delicate when the data argument isn't specified — lmer tries to look in the environment of the formula, but scoping can get complicated/confusing. It's always safest to specify data explicitly to avoid these headaches.