Search code examples
rgammgcv

Error when trying to fit Hierarchical GAMs (Model GS or S) using mgcv


I have a large dataset (~100k observations) of presence/absence data that I am trying to fit a Hierarchical GAM with individual effects that have a Shared penalty (e.g. 'S' in Pedersen et al. 2019). The data consists of temp as numeric, region (5 groups) as a factor. Here is a simple version of the model that I am trying to fit.

modS1 <- gam(occurrence ~ s(temp, region), family = binomial,
             data = df, method = "REML")

modS2 <- gam(occurrence ~ s(temp, region, k= c(10,4), family = binomial,
             data = df, method = "REML")

In the first case I received the following error: Which I assumed it because k was set too high for region given there are only 5 different regions in the data set.

Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In mean.default(xx) : argument is not numeric or logical: returning NA
2: In Ops.factor(xx, shift[i]) : ‘-’ not meaningful for factors

In the second case I attempt to lower k for region and receive this error:

Error in if (k < M + 1) { : the condition has length > 1
In addition: Warning messages:
1: In mean.default(xx) : argument is not numeric or logical: returning NA
2: In Ops.factor(xx, shift[i]) : ‘-’ not meaningful for factors

I can fit Models G and GI and I from Pedersen et al. 2019 with no issues. It is models GS and S where I run into issues.

If anyone has any insights I would really appreciate it!


Solution

  • The bs = "fs" argument in the code you're using as a guide is important. If we start at the ?s help page and click on the link to the ?smooth.terms help page, we see:

    Factor smooth interactions

    bs="fs" Smooth factor interactions are often produced using by variables (see gam.models), but a special smoother class (see factor.smooth.interaction) is available for the case in which a smooth is required at each of a large number of factor levels (for example a smooth for each patient in a study), and each smooth should have the same smoothing parameter. The "fs" smoothers are set up to be efficient when used with gamm, and have penalties on each null space component (i.e. they are fully ‘random effects’).

    You need to use a smoothing basis appropriate for factors.

    Notably, if you take your source code and remove the bs = "fs" argument and attempt to run gam(log(uptake) ∼ s(log(conc), Plant_uo, k=5, m=2), data=CO2, method="REML"), it will produce the same error that you got.