Specifically, when I pass a model (as a string) in lmer via as.formula() or formula() within the data.table code using the "by", it seems the entire data is being analyzed instead of stratified analysis by the "sim" variable. Sample Code below explains the issue I am facing.
Step1. I created a mock data as follows. The data below is not really longitudinal, but should suffice for the purposes of a demo of the issue I am facing.
library(data.table)
library(lme4)
# ---- Step1. multiple simulated data ----
Nsim=2
n=20
csize=4
set.seed(10)
y <- round(rnorm(Nsim*n*csize, 0, 3),2)
time <- rep(0:(csize-1),Nsim*n)
w <- rbinom(Nsim*n,size=1,prob=0.6); w<- rep(w, each=csize)
x<- round(2*runif(Nsim*n*csize)+1,2)
id = rep(rep(1:n,each=csize),Nsim)
sim=rep(1:Nsim,each=n*csize)
dat<- data.table(y, time, w, x, id, sim)
setkey(dat,sim)
# ---- End of (multiple simulated data) ----
Step 2. Forcefully running lmer model for each sim to see what the true results should look like.
cat('# ----------- Step 2. output we should expect -----------\n')
BIC(lmer(y~time+w+x+(1+time|id), data=dat[sim==1,]))
BIC(lmer(y~time+w+x+(1+time|id), data=dat[sim==2,]))
Step 3. Getting the above results using stratified analysis through by of data.table. Results from Step 2 and Step 3 should be the same and they are.
cat('# ------ Step 3. Stratified results using "by" of data.table ------\n')
dat[,BIC(lmer(y~time+w+x+(1+time|id))), by=sim]
Step 4. Running the stratified analysis by passing string-model in lmer through as.formula() or formula(). This code analyzes the whole data instead of stratified analysis by sim (see the results from Step 5)
cat('# ----- Step 4. trying "as.formula()" or "formula()" in data.table -----\n')
dat[,{
use.form<- as.formula('y~time+w+x+(1+time|id)')
BIC(lmer(use.form))
}, by=sim]
Step 5. I confirmed results from Step 4 are indeed incorrect since it ignores stratification by sim, as shown below.
cat('# ---------- Step 5. output NOT expected ------------------\n')
BIC(lmer(y~time+w+x+(1+time|id), data=dat)),'\n')
My code in Step 4 is definitely missing an important aspect of data.table and its by process. Any guidance from the community will be much appreciated.
Also, this is my first ever post here. Apologies if I missed something in terms of creating a good post.
@RichieSacramento points out that this is fixed by passing the data explicitly:
Passing
.SD
to thedata
argument of the model fixes the issue.dat[, BIC(lmer('y~time+w+x+(1+time|id)', .SD)), by=sim]
.
I can't say exactly what's going on without spending a lot more time digging, but evaluation of formulas in lme4
is rather delicate when the data
argument isn't specified — lmer
tries to look in the environment of the formula, but scoping can get complicated/confusing. It's always safest to specify data
explicitly to avoid these headaches.