How to specify a hierarchical GAM (HGAM) model with two categorical & a continuous predictor using s(y1, by=y2:y3)) in mgcv?

Starting with dummy data for the question:

zone <- c(rep("Z1",1000),rep("Z2",1000),rep("Z3",1000),rep("Z4",1000))
scenario <- rep(c(rep("S1",250),rep("S2",250),rep("S3",250),rep("S4",250)),4)
model <- rep(rep(c(rep("m1",50),rep("m2",50),rep("m3",50),rep("m4",50),rep("m5",50)),4),4)
time <- rep(seq(2021,2070), 80)
response <- rep(c(rep(rnbinom(50, mu =exp(4), size = 50), 5), 
        rep(rnbinom(50, mu =exp(4.5), size = 50), 5), 
        rep(rnbinom(50, mu =exp(5), size = 50), 5), 
        rep(rnbinom(50, mu =exp(6), size = 50), 5)),4)

df <- cbind(zone, scenario, model, time, response)
df <- as.data.frame(df)
df$zone <- as.factor(df$zone)
df$scenario <- as.factor(df$scenario)
df$model <- as.factor(df$model)
df$time <- as.numeric(df$time)
df$response <- as.numeric(df$response)

I'm trying to fit a fairly simple GAM model in mgcv in R. I'm interested in how a response variable might vary through time (predictions between 2021 - 2070). I expect that the response will be non-linear, so I first fit a smoother over time:

library(mgcv)

m1 <- response ~ s(time)

In this case, there are four models (m1, m2, m3, m4) that each predict how the response may vary into the future under four different "scenarios" of increasing intensity (S1, S2, S3, S4). I'm not explicitly interested in how the models vary (I expect them to), but instead I'm interested in how the scenarios (from all four model predictions) may vary into the future. To account for this structure of the data (i.e. that each scenario contains four model predictions), I've included "model" as a random effect:

m2 <- gam(response~s(time, k=-1) + scenario + s(model, bs='re'), data=df)

Given that the non-linear response between "response" and "time" is likely to vary among scenarios, I've fitted a model with separate smoothers for each level of scenario using the "by" function:

m3 <- gam(response~s(time, by=scenario, k=-1) + scenario + s(model, bs='re'), data=df)

To add to the complexity, the area that the predictions are modeled from is split into four separate zones, so an additive term of "zone" is included in the model:

m4 <- gam(response~s(time, by=scenario, k=-1) + scenario + zone + s(model, bs='re'), data=df)

My question is: I've accounted for the expectation that the "wigglyness" of the response through time may vary across scenarios in m3 by including s(time, by=scenario, k=-1), but... how do I expand that model to account for the potential for the response to vary not only by scenario but also by zone? I envisioned this as something like:

m5 <- gam(response~s(time, by=scenario:zone, k=-1) + scenario + zone + s(model, bs='re'), data=df)

But this isn't the correct syntax. After reading the honestly excellent paper "Hierarchical generalized additive models in ecology: an introduction with mgcv" in PeerJ several times now, but I'm struggling to understand the leap between the nonlinear functional relationship between a simple x~s(y) gam model and how the shape of the function varies between not one, but two different grouping levels.

As far as I can tell I'd need "A single common smoother plus group-level smoothers with differing wiggliness (Model GI)" approach here to allow each each group-specific smoother (here "scenario" and "zone") to have its own smoothing parameter and hence its own level of wiggliness, but i'm struggling to see how this model would fit into the HGAM approach. How can I adapt M5 to fit within the HGAM framework?

Solution

The factor by variable really just codes for the rows in the input data that should be kept together and get their own smooth.

So your intuition is right, that you need the interaction of scenario and zone, but in practice you can't do it via a formula operation inside s() or te() etc.

What you need to do is create the interaction first in your data and then pass that to the by argument:

df <- transform(df, zone_in_scenario = interaction(zone, scenario, drop = TRUE))

The you can use by = zone_in_scenario in your smooth.

You also need to adjust the factor fixed effects to be zone * scenario as you need group means for each combination of these two factors, and this will get you a group mean for each of the smooths that will now be fitted.