Search code examples
rsmoothinggammgcvbrms

mgcv GAM: more than one variable in `by` argument (smooth varying by more than 1 factor)


I need to model a smooth term over more than one factor. The by argument allows me to model one smooth per factor level, but I cannot find how to do that over multiple factors.

I tried solutions akin to the following, but with no success:

data <- iris
data$factor2 <- rep(c("A", "B"), 75)

mgcv::gam(Sepal.Length ~ s(Petal.Length, by = c(Species, factor2)), data = data)
#> Error in model.frame.default(formula = Sepal.Length ~ 1 + Petal.Length + : variable lengths differ (found for 'c(Species, factor2)')

Created on 2021-08-05 by the reprex package (v2.0.0)

Any help is welcomed!


Solution

  • One of the issue created by interaction() is that it changes the model's matrix, meaning that some variables contained in the model's data are changed:

    m <- mgcv::gam(body_mass_g ~ s(flipper_length_mm, by = interaction(species, sex)), data = palmerpenguins::penguins)
    head(insight::get_data(m))
    #>   body_mass_g flipper_length_mm       species    sex
    #> 1        3750               181   Adelie.male   male
    #> 2        3800               186 Adelie.female female
    #> 3        3250               195 Adelie.female female
    #> 5        3450               193 Adelie.female female
    #> 6        3650               190   Adelie.male   male
    #> 7        3625               181 Adelie.female female
    

    Created on 2021-08-06 by the reprex package (v2.0.1)

    This can lead to some issues when using postprocessing functions, for instance for visualisation.

    However, following Gavin's and IRTFM's answers, this can be easily addressed by adding the variables as fixed effects in the model.

    Here is a demonstration, also illustrating the differences between two separate smooths and the interaction:

    library(ggplot2)
    #> Warning: package 'ggplot2' was built under R version 4.0.5
    
    set.seed(1)
    
    # Create data
    data <- data.frame(x = rep(seq(-10, 10, length.out = 500), 2),
                       fac1 = as.factor(rep(c("A", "B", "C"), length.out = 1000)),
                       fac2 = as.factor(rep(c("X", "Y"), each = 500)))
    data$y <- data$x^2 + rnorm(nrow(data), sd = 5)
    data$y[data$fac1 == "A"] <- sign(data$x[data$fac1 == "A"]) * data$y[data$fac1 == "A"] + 50
    data$y[data$fac1 == "B"] <- datawizard::change_scale(data$y[data$fac1 == "B"]^3, c(-50, 100))
    data$y[data$fac2 == "X" & data$fac1 == "C"] <- data$y[data$fac2 == "X" & data$fac1 == "C"] - 100
    data$y[data$fac2 == "X" & data$fac1 == "B"] <- datawizard::change_scale(data$y[data$fac2 == "X" & data$fac1 == "B"] ^ 2, c(-50, 100))
    data$y[data$fac2 == "X" & data$fac1 == "A"] <- datawizard::change_scale(data$y[data$fac2 == "X" & data$fac1 == "A"] * -3, c(0, 100))
    
    # Real trends
    ggplot(data, aes(x = x, y = y, color = fac1, shape = fac2)) + 
      geom_point()
    

    # Two smooths
    m <- mgcv::gam(y ~ fac1 * fac2 + s(x, by = fac1) + s(x, by = fac2), data = data)
    plot(modelbased::estimate_relation(m, length = 100, preserve_range = F))
    

    # Interaction
    m <- mgcv::gam(y ~ fac1 * fac2 + s(x, by = interaction(fac1, fac2)), data = data)
    plot(modelbased::estimate_relation(m, length = 100, preserve_range = F))
    

    Created on 2021-08-06 by the reprex package (v2.0.1)

    The last model manages to recover the trends for each of the factors' combination.