Search code examples
rmixed-modelsmulti-levelnlmer2jags

Specifying a non-linear relation between random effects in R


I'm working with multilevel models to try and describe different patterns in longitudinal change. Dingemanse et al (2010) describe a 'fanning out' pattern when the random effects are perfectly correlated. However I found that a similar pattern occurs when the relationship between random effects is non-linear but monotonically increasing over the observed interval. In this case random effects are not perfectly correlated but described by a function. See example below for an illustration of this. The example still has a high intercept-slope correlation (>.9) but it is possible to get correlations lower than .7 while still maintaining a perfect intercept-slope relation.

My question: Is there a way to include a perfect (non-linear) relationship between random effects in a multilevel model using nlme or some other R package? MLwiN has a way to constrain the slope-intercept covariance, that would be a start but is not sufficient to include non-linear relations. I've been unable to find solutions for nlme so far, but maybe you know of some other package that can include this in the model.

Apologies for the sloppy coding. I hope that my question is sufficiently clear but let me know if anything needs to be clarified. Any help or alternative solution is greatly appreciated.

set.seed(123456)

# Change function, quadratic
# Yit = B0ij + B1ij*time + B2ij*time^2
chn <- function(int, slp, slp2, time){
  score<-int + slp * time+ slp2 * time^2 
  return(score)
}


# Set N, random intercept, time and ID
N<-100
start<-rnorm(N,100,15) # Random intercept
time<- matrix(1:15,ncol = 15, nrow = 100,byrow = T) # Time, balanced panel data
ID<-1:N # ID variable

# Random intercept, linear slope: exp(intercept/25)/75, quadratic slope: exp(intercept/25)/250 
score3<-matrix(NA,ncol = ncol(time), nrow = N)
for(x in ID){
  score3[x,]<-chn(start[x],exp(start[x]/25)/75,exp(start[x]/25)/250,time[x,])
}

#Create dataframe
df<- data.frame(ID,score3)
df2<- melt(df,id = 'ID')
df2$variable<-as.vector(time)


# plot
ggplot(df2, aes(x= variable, y= value)) + geom_line(aes(group = ID)) +
  geom_smooth(method = "lm", formula = y ~ x + I(x^2), se =F, size = 2, col ="red" )


# Add noise and estimate model
df2$value2<-df2$value + rnorm(N*ncol(time),0,2)

# Random intercept
mod1<-lme(value2 ~ variable + I(variable^2),
          random= list(ID = ~1),
          data=df2,method="ML",na.action=na.exclude)
summary(mod1)

# Random slopes
mod2<-update(mod1,.~.,random= list(ID = ~variable + I(variable^2)))
summary(mod2)


pairs(ranef(mod2))

Solution

  • Based on the suggestion by @MattTyers I had a go with a Bayesian approach using rjags. It's a first attempt on simulated data with a known relation between the random effects but seems to yield accurate estimates (better then the nlme model). I am still a bit worried about the Gelman convergence diagnostic and how to apply this solution to actual data. However, I figured I'd post my answer in case someone is working on the same problem.

    # BAYESIAN ESTIMATE
    library(ggplot2); library(reshape2)
    # Set new dataset
    set.seed(12345)
    
    # New dataset to separate random and fixed
    N<-100              # Number of respondents
    int<-100            # Fixed effect intercept
    U0<-rnorm(N,0,15)   # Random effect intercept
    slp_lin<-1          # Fixed effect linear slope
    slp_qua<-.25        # Fixed effect quadratic slope
    ID<- 1:100          # ID numbers
    U1<-exp(U0/25)/7.5  # Random effect linear slope
    U2<-exp(U0/25)/25   # Random effect quadratic slope
    times<-15           # Max age
    err <- matrix(rnorm(N*times,0,2),ncol = times, nrow = N) # Residual term
    age <- 1:15         # Ages
    
    # Create matrix of 'math' scores using model
    math<-matrix(NA,ncol = times, nrow = N)
    
    for(i in ID){
    
      for(j in age){
    
    math[i,j] <- (int + U0[i]) + 
      (slp_lin + U1[i])*age[j] + 
      (slp_qua + U2[i])*(age[j]^2) + 
      err[i,j] 
    
    }}
    
    # Melt dataframe and plot scores
    e.long<-melt(math)
    names(e.long) <- c("ID","age","math")
    ggplot(e.long,aes(x= age, y= math)) + geom_line(aes(group = ID))
    
    # Create dataframe for rjags
    dat<-list(math=as.numeric(e.long$math),
              age=as.numeric(e.long$age), 
              childnum=e.long$ID,
              n=length(e.long$math), 
              nkids=length(unique(e.long$ID)))
    lapply(dat , summary)
    
    
    library(rjags)
    
    # Model with uninformative priors
    model_rnk<-"
    model{
    
    #Model, fixed effect age and random intercept-slope connected
    for( i in 1:n)
    {
      math[i]~dnorm(mu[i], sigm.inv)
      mu[i]<-(b[1] + u[childnum[i],1]) + (b[2]+ u[childnum[i],2]) * age[i] + 
      (b[3]+ u[childnum[i],3]) * (age[i]^2) 
    }
    
    #Random slopes
    for (j in 1:nkids)
    {
      u[j, 1] ~ dnorm(0, tau.a)
      u[j, 2] <- exp(u[j,1]/25)/7.5
      u[j, 3] <- exp(u[j,1]/25)/25
    }
    
    #Priors on fixed intercept and slope priors
    b[1] ~ dnorm(0.0, 1.0E-5)
    b[2] ~ dnorm(0.0, 1.0E-5)
    b[3] ~ dnorm(0.0, 1.0E-5)
    
    # Residual variance priors
    sigm.inv ~ dgamma(1.5, 0.001)# precision of math[i]
    sigm<- pow(sigm.inv, -1/2)   # standard deviation
    
    
    # Varying intercepts, varying slopes priors
    tau.a ~ dgamma(1.5, 0.001)
    sigma.a<-pow(tau.a, -1/2)
    }"
    
    #Initialize the model
    mod_rnk<-jags.model(file=textConnection(model_rnk), data=dat, n.chains=2 )
    
    #burn in 
    update(mod_rnk, 5000)
    
    #collect samples of the parameters
    samps_rnk<-coda.samples(mod_rnk, variable.names=c( "b","sigma.a", "sigm"), n.iter=5000, n.thin=1)
    
    #Numerical summary of each parameter:
    summary(samps_rnk)
    
    gelman.diag(samps_rnk,  multivariate = F)
    
    # nlme model
    library(nlme)
    Stab_rnk2<-lme(math ~ age + I(age^2),
                   random= list(ID = ~age + I(age^2)),
                   data=e.long,method="ML",na.action=na.exclude)
    summary(Stab_rnk2)
    

    The outcome looks very close to the population values

    1. Empirical mean and standard deviation for each variable,
       plus standard error of the mean:
    
                    Mean       SD  Naive SE Time-series SE
        b[1]    100.7409 0.575414 5.754e-03      0.1065523
        b[2]      0.9843 0.052248 5.225e-04      0.0064052
        b[3]      0.2512 0.003144 3.144e-05      0.0003500
        sigm      1.9963 0.037548 3.755e-04      0.0004056
        sigma.a  16.9322 1.183173 1.183e-02      0.0121340
    

    And the nlme estimates are far worse (with the exception of the random intercept)

    Random effects:
     Formula: ~age + I(age^2) | ID
     Structure: General positive-definite, Log-Cholesky parametrization
                StdDev      Corr        
    (Intercept) 16.73626521 (Intr) age  
    age          0.13152688 0.890       
    I(age^2)     0.03752701 0.924  0.918
    Residual     1.99346015             
    
    Fixed effects: math ~ age + I(age^2) 
                    Value Std.Error   DF  t-value p-value
    (Intercept) 103.85896 1.6847051 1398 61.64816       0
    age           1.15741 0.0527874 1398 21.92586       0
    I(age^2)      0.30809 0.0048747 1398 63.20204       0