I have a large dataset (~100k observations) of presence/absence data that I am trying to fit a Hierarchical GAM with individual effects that have a Shared penalty (e.g. 'S' in Pedersen et al. 2019). The data consists of temp as numeric, region (5 groups) as a factor. Here is a simple version of the model that I am trying to fit.
modS1 <- gam(occurrence ~ s(temp, region), family = binomial,
data = df, method = "REML")
modS2 <- gam(occurrence ~ s(temp, region, k= c(10,4), family = binomial,
data = df, method = "REML")
In the first case I received the following error: Which I assumed it because k was set too high for region given there are only 5 different regions in the data set.
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In mean.default(xx) : argument is not numeric or logical: returning NA
2: In Ops.factor(xx, shift[i]) : ‘-’ not meaningful for factors
In the second case I attempt to lower k for region and receive this error:
Error in if (k < M + 1) { : the condition has length > 1
In addition: Warning messages:
1: In mean.default(xx) : argument is not numeric or logical: returning NA
2: In Ops.factor(xx, shift[i]) : ‘-’ not meaningful for factors
I can fit Models G and GI and I from Pedersen et al. 2019 with no issues. It is models GS and S where I run into issues.
If anyone has any insights I would really appreciate it!
The bs = "fs"
argument in the code you're using as a guide is important. If we start at the ?s
help page and click on the link to the ?smooth.terms
help page, we see:
Factor smooth interactions
bs="fs"
Smooth factor interactions are often produced using by variables (seegam.models
), but a special smoother class (seefactor.smooth.interaction
) is available for the case in which a smooth is required at each of a large number of factor levels (for example a smooth for each patient in a study), and each smooth should have the same smoothing parameter. The"fs"
smoothers are set up to be efficient when used withgamm
, and have penalties on each null space component (i.e. they are fully ‘random effects’).
You need to use a smoothing basis appropriate for factors.
Notably, if you take your source code and remove the bs = "fs"
argument and attempt to run gam(log(uptake) ∼ s(log(conc), Plant_uo, k=5, m=2), data=CO2, method="REML")
, it will produce the same error that you got.