I am trying to run a gam model on a small data set
library(ggplot2)
library(mgcv)
test <- structure(list(x = c(69, 365, 452, 100, 120, 120, 150, 159, 180
), y = c(17.91, 2.58, 4.82, 10.09, 6.24, 10.33, 2.35, 1.94, 3.91
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-9L))
This data looks like this
ggplot(test, aes(x = x, y = y)) + geom_point()
I'd like to try fitting this data with a spline with gam. I tried the following, which returns an error
mod <- gam(test$y ~ s(test$x))
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : A term has fewer unique covariate combinations than specified maximum degrees of freedom
Based on this post http://r.789695.n4.nabble.com/Help-with-GAM-mgcv-td3074165.html it seems like maybe I need to mandate fewer knots.
I tried telling the model to only use two knots as follows (figuring I'd figure out the optimum number somehow later):
mod <- gam(test$y ~ s(test$x), k = 2)
Error in data[[txt]] : subscript out of bounds
I'm not sure what this later error means, or why I am getting it.
Just in case it is of any use, here is the traceback for that error
traceback()
7: get.var(object$term[i], knots) 6: ExtractData(object, data, knots) 5: smooth.construct3(object, data, knots) 4: smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons, scale.penalty = scale.penalty, null.space.penalty = select, sparse.cons = sparse.cons, diagonal.penalty = diagonal.penalty, apply.by = apply.by, modCon = modCon) 3: gam.setup(formula = list(pf = test$y ~ 1, pfok = 1, smooth.spec = list( list(term = "test$x", bs.dim = -1, fixed = FALSE, dim = 1L, p.order = NA, by = "NA", label = "s(test$x)", xt = NULL, id = NULL, sp = NULL)), fake.formula = test$y ~ 1 + test$x, response = "test$y", fake.names = "test$x", pred.names = c("test", "x"), pred.formula = ~test + x), pterms = test$y ~ 1, data = list( `test$y` = c(17.91, 2.58, 4.82, 10.09, 6.24, 10.33, 2.35, 1.94, 3.91), `test$x` = c(69, 365, 452, 100, 120, 120, 150, 159, 180)), knots = 2, sp = NULL, min.sp = NULL, H = NULL, absorb.cons = TRUE, sparse.cons = 0, select = FALSE, idLinksBases = TRUE, scale.penalty = TRUE, paraPen = NULL, drop.intercept = FALSE) 2: do.call(gsname, list(formula = gp, pterms = pterms, data = mf, knots = knots, sp = sp, min.sp = min.sp, H = H, absorb.cons = TRUE, sparse.cons = 0, select = select, idLinksBases = control$idLinksBases, scale.penalty = control$scalePenalty, paraPen = paraPen, drop.intercept = drop.intercept)) 1: gam(test$y ~ s(test$x), k = 2)
And session info
sessionInfo()
R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.1 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] mgcv_1.8-25 nlme_3.1-137 ggplot2_3.1.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.19 rstudioapi_0.8 bindr_0.1.1 knitr_1.20 magrittr_1.5 tidyselect_0.2.5 munsell_0.5.0 [8] lattice_0.20-38 colorspace_1.3-2 R6_2.3.0 rlang_0.3.0.1 plyr_1.8.4 dplyr_0.7.7 tools_3.5.1 [15] grid_3.5.1 packrat_0.4.9-3 gtable_0.2.0 withr_2.1.2 yaml_2.2.0 lazyeval_0.2.1 assertthat_0.2.0 [22] tibble_1.4.2 crayon_1.3.4 Matrix_1.2-15 bindrcpp_0.2.2 purrr_0.2.5 glue_1.3.0 labeling_0.3 [29] compiler_3.5.1 pillar_1.3.0 scales_1.0.0 pkgconfig_2.0.2
I am wondering if anyone has suggestions about how I might best deal with this error or otherwise get a working gam with this data. Thanks for ideas.
If you use k=3 there's no problem (at least with the method I used which passes the dataframe to the data
argument>
> mod <- gam(y ~ s(x, k=3), data=test)
> mod
Family: gaussian
Link function: identity
Formula:
y ~ s(x, k = 3)
Estimated degrees of freedom:
1.95 total = 2.95
GCV score: 9.922357
> mod <- gam(y ~ s(x, k=4), data=test)
>
> mod
Family: gaussian
Link function: identity
Formula:
y ~ s(x, k = 4)
Estimated degrees of freedom:
2.67 total = 3.67
GCV score: 7.77551
> mod <- gam(y ~ s(x, k=5), data=test)
> mod
Family: gaussian
Link function: identity
Formula:
y ~ s(x, k = 5)
Estimated degrees of freedom:
2.94 total = 3.94
GCV score: 7.758121
Since the GCV score doesn't increase when increasing from 4 to 5 I'm showing what you get with k=4:
> mod <- gam(y ~ s(x, k=4), data=test)
> png()
> plot(mod)
> points(y~x, data=test)
> dev.off()
RStudioGD
2
I'm pretty sure that the reason for the offset of s(x,2.67)
from the points is failure to include the intercept term, which would be another (unplotted) "degree of freedom". If you want the predictions of the full model (which will go through the data points, rather than just the contribution from the smoothing term, then plot predict(mod, newdata=list( x= seq(min(x),max(x), length=100)))
against that same sequence.