Search code examples
rmixed-modelsnlmelongitudinalmulticollinearity

Collinearity between intercept and slopes that vary by time coding - linear mixed effects model


I'm currently trying to run a linear mixed effects model to estimate how stress changes as a function of time (over 6 timepoints). I've noticed that when the intercepts and slopes of stress trajectories are extracted for each individual in my sample, there is a high correlation between these factors. However, this seems to change depending on how time is coded ... any explanation for why this might be happening? How would you usually deal with collinearity between intercepts and slopes? Any suggestions would be greatly appreciated.

Reproducible example below:

#toy data code: https://rpubs.com/mlmcternan/BC-lme
options(width = 100)

library(lme4)
library(nlme)
library(dplyr)
library(kableExtra)
library(psych)
library(car)
library(ggplot2)
library(ggpubr)
library(tidyverse)

set.seed(222)

# simulate data
N <- 150
nobs <- 6
sigma <- 20
sig00 <- 10
b0 <- 80
b1 <- -6

ID = rep(1:N, each = nobs)

# time is coded 0 to 5 here

Time = rep(0:5, 150)
dat <- data.frame(ID, Time)
u0 <- rnorm(N, 0, sig00)
dat$u0 <- rep(u0, each=nobs)
dat$err <- rnorm(900, 0, sigma)
dat$Stress <- (b0 + u0)  + b1*dat$Time + dat$err
dat <- dat[,-c(3:4)]

# fit the model
fit1 <- lme(Stress ~ Time, random=~Time|ID, data=dat, correlation=corCAR1(form=~Time|ID), method="ML",na.action=na.exclude, control=list(opt="optim"))

# extract predictions
predictions <- data.frame(coef(fit1))
names(predictions) <- c("Intercept", "Slope")

# correlate intercept and slope
cor.test(predictions$Intercept, predictions$Slope, method = "pearson")

# Correlation between Intercept and Slope = 0.8057988 

# plot correlation
plot(predictions$Intercept, predictions$Slope, main="Scatterplot Example", 
     xlab="Intercept ", ylab="Slope ", pch=19)

# Recode time - 0 represents predicted mean at timepoint 2
dat$Time <-  rep(-1:4, 150)

fit1 <- lme(Stress ~ Time, random=~Time|ID, data=dat, correlation=corCAR1(form=~Time|ID), method="ML",na.action=na.exclude, control=list(opt="optim"))

predictions <- data.frame(coef(fit1))
names(predictions) <- c("Intercept", "Slope")
cor.test(predictions$Intercept, predictions$Slope, method = "pearson")
#correlation between intercept and slope = 0.6314682 

plot(predictions$Intercept, predictions$Slope, main="Scatterplot Example", 
     xlab="Intercept ", ylab="Slope ", pch=19)

Solution

  • This turns out to be more of a statistics/math question rather than one about coding. Anyway, I think the reason for your varying correlation is ingrained in the way you define/simulate the data.

    Have a look at (both versions of) your simulated data using a spaghetti plot:

    library(ggplot2)
    
    ggplot(dat, aes(y = Stress, x = Time, colour=as.factor(ID) )) +
      geom_smooth(method=lm, se=FALSE) + theme(legend.position = "none") +
      geom_vline(xintercept=0)
    

    You will see that the regression lines for each ID are closer together in the middle of the plot (around Time == 2.5) than at each end. This is because b1*Time is bigger for larger values of abs(Time).

    Now, when you change the way you code Time from 0 - 5 to -1 - 4, you move the y-axis to the right where the lines are closer together. Because the intercepts are defined as the value of Stress where regression lines cross the y-axis, moving the y-axis brings the intercepts closer together. Or, in other words, moving the y-axis changes the impact of the slope on the intercept.

    This is why it is often recommended to center predictors when you have moderators and/or random slopes. It makes the interpretation of intercepts easier.