Search code examples
rmatrixgammgcvcontrast

GAM distributed lag model with factor smooth interaction (by variable)


I'm trying to compare the climate response in the last 60 years of two subgroups of a plant (factor variable subgroups with 2 levels). The response of the two subgroups which both grew on the same plots is measured in deviation from the long-term growth (plant_growth). As climate data mean temperature (tmean) and mean precipitation (prec) are available. I formulated a distributed lag model using mgcv's gam() to test the hypothesis, that the climate response differs between the plant subgroups:

climate_model <- gam(plant_growth ~ te(tmean, lag, by = subgroups) +
                                    te(prec, lag, , by = subgroups) +
                                    te(tmean, prec, lag, , by = subgroups) ,
                                  data = plant_data)

plant_data is a list that contains tmean, prec and lag as separate numeric matrices, subgroups as factor variable which distinguishes between subgroup A and B, a character variable giving the ID of the plant, and the numeric measured plant_growth as vector.

The problem is, however, that factor by variables cannot be used with the matrix arguments from plant_data. The error message looks as follows:

Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons, scale.penalty = scale.penalty,  : 
  factor `by' variables can not be used with matrix arguments.

I'm wondering if there is a way to include the factor variable subgroups into the distributed lag model so that a comparison between the two levels of the factor is possible.

I've already tried running two separate lag models for the two levels of subgroups. This works fine. However, I cannot really compare the predictions of the two models because the fit and the parameters of the smooths are different. Moreover, in this way the the climate response of the two subgroups is treated as if it was completely independent. This is however not the case.

I was reproduce my problem with growth data from the Treeclim package:

library("treeclim") #Data library
data("muc_spruce") #Plant growth
data("muc_clim") #Climate data

#Format climate to wide
clim <- pivot_wider(muc_clim, names_from = month, values_from = c(temp,prec))

#Format the growth data and add three new groth time series
growth <- muc_spruce %>%
            select(-samp.depth) %>%
            mutate(year = as.numeric(row.names(muc_spruce))) %>%
            mutate(ID = 1) %>%
            rename("plant_growth" = "mucstd")

additional_growth <- data.frame()
for (i in c(1:3)){
  
A <- growth %>%
  mutate(plant_growth = plant_growth + runif(nrow(muc_spruce), min = 0, max = 0.5)) %>%
  mutate(ID = ID + i)

additional_growth <- rbind(additional_growth, A)
}

growth <- rbind(growth, additional_growth)

#Bring growth and climate data together
plant_data <- na.omit(left_join(growth, clim))
rm(A, growth, clim, muc_clim, muc_spruce, additional_growth, i) #clean

#Add the subgroups label
plant_data$subgroups <- as.factor(c(rep("A", nrow(plant_data)/2), rep("B", nrow(plant_data)/2)))

#Format for gam input
plant_data <- list(lag = matrix(1:12,nrow(plant_data),12,byrow=TRUE),
                       year = plant_data$year,
                       ID = plant_data$ID,
                       plant_growth = plant_data$plant_growth,
                       subgroups = as.factor(plant_data$subgroups),
                       tmean = data.matrix(plant_data[,c(4:15)]),
                       prec = data.matrix(plant_data[,c(16:27)]))

Solution

  • These are challenging to implement, but fortunately they are possible. As the response above suggests, if you look at the help for linear functionals (?mgcv::linear.functional.terms and ?mgcv::gam.models), you will see that the by variable is used as a weighting matrix, and the resulting smooth is constructed as rowSums(L*F), where F is your distributed lag smooth and L is the weighting matrix. I spent a lot of time on this for a few recent projects, and finally realised that I needed to supply group-level distributed lag terms for each group in the formula (this is reasonably well-described here: https://github.com/nicholasjclark/portal_VAR). In short, you can create weight matrices for each level of the grouping factor for setting up group-level distributed lag terms:

    weights_l1 <- weights_l2 <- matrix(1, ncol = ncol(plant_data$lag), 
                                       nrow = nrow(plant_data$lag))
    

    For each level in the grouping factor, the rows in its weighting matrix need to have 1s when the corresponding observation in the data matches that series, and 0s otherwise

    weights_l1[!(plant_data$subgroups == levels(plant_data$subgroups)[1], ] <- 0
    plant_data$weights_l1 <- weights_l1
    
    weights_l2[!(plant_data$subgroups == levels(plant_data$subgroups)[2], ] <- 0
    plant_data$weights_l2 <- weights_l2
    
    

    Now you are ready to fit a model with different distributed lag terms per level of the grouping factor

    climate_model <- gam(plant_growth ~ 
                           te(tmean, lag, by = weights_l1) +
                           te(tmean, lag, by = weights_l2) +
                           te(prec, lag, by = weights_l1) +
                           te(prec, lag, by = weights_l2),
                         data = plant_data)