Search code examples
rhierarchical-datamixed-modelsgammgcv

Troubles predicting fixed effects from a hierarchical GAM in mgcv


I have been fitting different hierarchical GAMs (hereafter: HGAM) using mgcv in R. I can extract and plot their predictions for their random effects without problems. Conversely, extracting and plotting their predictions for their fixed effects only works for some models, and I don't know why.

Here is a practical example, which refers to the color spectra of flowers from two species (Taxon) sampled at various localities (also discussed here):

rm(list=ls()) # wipe R's memory clean
library(pacman) # load packages, installing them from CRAN if needed
p_load(RCurl) # allows accessing data from URL
ss <- read.delim(text=getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/flower_color_spectra.txt"))
head(ss)
ss$density <- ifelse(ss$density<0, 0, ss$density) # set spurious negative reflectance values to zero
ss$clr <- ifelse(ss$Taxon=="SpeciesB", "red", "black")
ss <- with(ss, ss[order(Locality, wl), ])

These are the mean color spectra at the population level for the two species (rolling means were used):

enter image description here

Each color refers to a different species. Each line refers to a different locality.

The following model is a HGAM of type G according to Pedersen et al.'s classification (2019) and it does not give any issues:

gam_G1 <- bam(density ~ Taxon # main effect
        + s(wl, by = Taxon, k = 20) # interaction
        + s(Locality, bs="re"), # "re" is short for "random effect"
        data = ss, method = 'REML',
        family="quasipoisson"
        )
# gam.check(gam_G1)
# k.check(gam_G1)   
# MuMIn::AICc(gam_G1)
# gratia::draw(gam_G1)
# plot(gam_G1, pages=1)

# use gam_G1 to predict wl by Locality

# dataset of predictor values to estimate response values for:    
nn <- unique(ss[, c("wl", "Taxon", "Locality", "clr")])
# predict:
pred <- predict(object= gam_G1, newdata=nn, type="response", se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit

# use gam_G1 to predict wl by Taxon
    
# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl", 
                "Taxon", 
                "Locality",
                "clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127

# predict:
pred <- predict(object = gam_G1, 
                type="response", 
                newdata=nn, 
                se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit

R warns me that factor levels 0 not in original fit, but it executes the task without issues:

enter image description here

Left panel: gam_G1 predictions at the Locality level. Right panel: gam_G1 predictions for the fixed effects.

Troublesome models

The following model is a HGAM of type "GI" sensu Pedersen et al. (2019). It produces more accurate predictions at the Locality level, but I can only get NA as predictions at the level of fixed effects:

# GI: models with a global smoother for all observations, 
# plus group-level smoothers, the wiggliness of which is estimated individually 
start_time <- Sys.time()
gam_GI1 <- bam(density ~ Taxon # main effect
        + s(wl, by = Taxon, k = 20) # interaction
        + s(wl, by = Locality, bs="tp", m=1)
        # "tp" is short for "thin plate [regression spline]"
        + s(Locality, bs="re"),
        family="quasipoisson",
        data = ss, method = 'REML'
        )
end_time <- Sys.time()
end_time - start_time # it took ~2.2 minutes on my computer
# gam.check(gam_GI1)
# k.check(gam_GI1)
# MuMIn::AICc(gam_GI1)

Attempt at drawing predictions for the fixed effects (Taxon and wl) according to gam_GI1:

# dataset of predictor values to estimate response values for:
nn <- unique(ss[, c("wl", 
                "Taxon", 
                "Locality",
                "clr")])
nn$Locality=0 # turns random effect off
# after https://stats.stackexchange.com/q/131106/214127

# predict:
pred <- predict(object = gam_GI1, 
                type="response", 
                # exclude="c(Locality)", 
                # # this should turn random effect off
                # # (doesn't work for me)
                newdata=nn, 
                se.fit=T)
nn$fit <- pred$fit
nn$se <- pred$se.fit
head(nn)
#       wl    Taxon Locality clr fit se
# 1 298.34 SpeciesB        0 red  NA NA
# 2 305.82 SpeciesB        0 red  NA NA
# 3 313.27 SpeciesB        0 red  NA NA
# 4 320.72 SpeciesB        0 red  NA NA
# 5 328.15 SpeciesB        0 red  NA NA
# 6 335.57 SpeciesB        0 red  NA NA

enter image description here

Left panel: gam_GI1 predictions at the Locality level. Right panel (blank): gam_GI1 predictions for the fixed effects.

The following model, which includes a global smoother for all observations, plus group-level smoothers, all with the same "wiggliness", doesn't provide fixed-effect predictions either:

gam_GS1 <- bam(density ~ Taxon # main effect
        + s(wl, by = Taxon, k = 20) # interaction
        + s(wl, by = Locality, bs="fs", m=1),
        # "fs" is short for "factor-smoother [interaction]"
        family="quasipoisson",
        data = ss, method = 'REML'
        )

Why don't gam_GI1 and gam_GS1 produce predictions for their fixed effects, and how can I obtain them?


The models can take a few minutes to run. To save time, their output can be downloaded from here as an RData file. My R scripts (which include the code for plotting the figures) are available here.


Solution

  • I think you are conflating several things here; The by trick to turn off random effects only works for bs = "re" smooths. Locality is a factor (otherwise your random effect isn't a random intercept) and setting it to 0 is creating a new level (although it could be creating an NA as 0 isn't among the original levels.

    If what you want to do is turn off anything to do with Locality, you should use exclude; however you have the invocation wrong. The reason why it's not working is because you are creating a character vector with a single element "c(Locality)". This fails for obvious reasons once you realize that c(Locality) doesn't related to anything in your model. What you need to provide here is a vector of smooth names as printed by summary(). For example, to exclude the smooth s(Locality, bs = "re"), {mgcv} knows this as s(Locality), so you would use exclude = "s(Locality)".

    In your case, it is tedious to type out all the "s(wl):LocalityLevelX" labels for each smooth. As you have only two taxa, it would be easier to use the complimentary argument terms, where you list smooth labels that you want to include in the model. So you could do terms = c("s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC") or whatever summary() displays for these smooths.

    You also need to include the Taxon term in terms, which I think needs to be:

    terms = c("TaxonSpeciesB", TaxonSpeciesC", 
              "s(wl):TaxonSpeciesB", "s(wl):TaxonSpeciesC")
    

    If you install and load my {gratia} package, you can use smooths(gam_GI1) to list all the smooth labels as far as {mgcv} knows them.

    The by trick works like this:

    gam(y ~ x + s(z) + s(id, bs = "re", by = dummy)
    

    where dummy is set to a numeric value 1 when fitting and to 0 when you are predicting. As this is a numeric by variable you are multiplying the smooth by dummy and hence why setting it to 0 excludes the term. The reason why your code isn't working is because you really want separate smooths for wl for each Locality; Locality is an actual variable of interest in your data/model, not a dummy variable we create to achieve the aim of excluding a term from the model.

    Hopefully now you can see why exclude and terms are much better solutions than this dummy trick.

    FYI, in bs = "tp", the "tp" doesn't mean tensor product smooth. It mean thin plate regression spline (TPRS). You only get tensor product smooths through te(), t2(), or ti() terms.