I have a dataset where the outcome is a continuous variable measuring the Wellbeing Index. The key variables are:
t: Time in days.
D: Binary intervention variable (0 = No Swimming, 1 = Yes Swimming).
sex: A two-level categorical variable (male, female).
race: A four-level categorical variable.
URL <- "https://github.com/DS4PS/pe4ps-textbook/blob/master/data/time-series-example.rds?raw=true"
dataTS <- readRDS(gzcon(url( URL )))
dataTS$Y <- round(dataTS$Y,1)
dataTS$sex <- sample(c("male", "female"), size = NROW(dataTS), replace = TRUE)
dataTS$race <- sample(c("Race1", "Race2", "Race3", "Race4"), size = NROW(dataTS), replace = TRUE)
The goal is to evaluate the effect of introducing Swimming classes (D) on the Wellbeing Index using an Interrupted Time Series (ITS) model.
plot( dataTS$T, dataTS$Y,
bty="n", pch=19, col="gray",
ylim = c(0, 300), xlim=c(0,400),
xlab = "Time (days)",
ylab = "Wellbeing Index" )
# Line marking the interruption
abline( v=200, col="firebrick", lty=2 )
text( 200, 300, "Start of Swimming Classes", col="firebrick", cex=1.3, pos=4 )`
Specifically, these are the parameters of interest:
Level change (β2): The immediate change in the Wellbeing Index after the intervention. (change level)
Slope change (β3): The change in the trend of the Wellbeing Index over time following the intervention.
So I fit a simple regression model like this
ts <- lm(Y ~ T + D + T*D + sex + race, data = dataTS )
After fitting the model, I want to use emmeans
to visualize the interaction between t
and D
.
Expecting a plot like this:
How can I achieve this?
I don't understand why you want to use emmeans
. It doesn't seem necessary for your goal.
You can't average over categorical variables. Thus, I propose faceting the plot. Your model seems also not specified well. It is missing interactions that seem important if your hypothesis is that sex and race have an impact.
It seems like a good idea to center T to the intervention time.
ts <- lm(Y ~ I(T - 200)*D + sex + race, data = dataTS )
summary(ts)
anova(ts)
pred1 <- expand.grid(T = seq(from = min(dataTS$T), to = max(dataTS$T), by = 1),
sex = unique(dataTS$sex),
race = unique(dataTS$race))
pred1$D <- as.integer(pred1$T >= 200)
pred1$Y <- predict(ts, newdata = pred1)
pred2 <- pred1[pred1$T > 200,]
pred2$D <- 0
pred2$Y <- predict(ts, newdata = pred2)
library(ggplot2)
ggplot(dataTS, aes(x = T, y = Y)) +
geom_point() +
geom_line(data = pred1, aes(group = factor(D), color = "predicted")) +
geom_line(data = pred2, aes(color = "counterfactual")) +
geom_vline(linetype = 2, aes(xintercept = 200, color = "intervention")) +
facet_grid(sex ~ race)
PS: Being from Germany and considering our history, I'm deeply uncomfortable with the US custom of categorising human beings into "races". I urge you to read some papers on the concept of race as applied to humans (e.g., Schaare et al., 2023, Duello et al., 2021), in particular if you apply it in a healthcare or fitness context.