Search code examples
rgam

Gam is R returning either "degrees of freedom" or "subscript out of bounds" error


I am trying to run a gam model on a small data set

library(ggplot2)
library(mgcv)

test <- structure(list(x = c(69, 365, 452, 100, 120, 120, 150, 159, 180
), y = c(17.91, 2.58, 4.82, 10.09, 6.24, 10.33, 2.35, 1.94, 3.91
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-9L))

This data looks like this

ggplot(test, aes(x = x, y = y)) + geom_point()

enter image description here

I'd like to try fitting this data with a spline with gam. I tried the following, which returns an error

mod <- gam(test$y ~ s(test$x))
 Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
 A term has fewer unique covariate combinations than specified maximum
 degrees of freedom

Based on this post http://r.789695.n4.nabble.com/Help-with-GAM-mgcv-td3074165.html it seems like maybe I need to mandate fewer knots.

I tried telling the model to only use two knots as follows (figuring I'd figure out the optimum number somehow later):

mod <- gam(test$y ~ s(test$x), k = 2)
 Error in data[[txt]] : subscript out of bounds

I'm not sure what this later error means, or why I am getting it.

Just in case it is of any use, here is the traceback for that error

 traceback()
7: get.var(object$term[i], knots)
6: ExtractData(object, data, knots)
5: smooth.construct3(object, data, knots)
4: smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons, scale.penalty = scale.penalty, 
       null.space.penalty = select, sparse.cons = sparse.cons, diagonal.penalty = diagonal.penalty, 
       apply.by = apply.by, modCon = modCon)
3: gam.setup(formula = list(pf = test$y ~ 1, pfok = 1, smooth.spec = list(
       list(term = "test$x", bs.dim = -1, fixed = FALSE, dim = 1L, 
           p.order = NA, by = "NA", label = "s(test$x)", xt = NULL, 
           id = NULL, sp = NULL)), fake.formula = test$y ~ 1 + test$x, 
       response = "test$y", fake.names = "test$x", pred.names = c("test", 
       "x"), pred.formula = ~test + x), pterms = test$y ~ 1, data = list(
       `test$y` = c(17.91, 2.58, 4.82, 10.09, 6.24, 10.33, 2.35, 
       1.94, 3.91), `test$x` = c(69, 365, 452, 100, 120, 120, 150, 
       159, 180)), knots = 2, sp = NULL, min.sp = NULL, H = NULL, 
       absorb.cons = TRUE, sparse.cons = 0, select = FALSE, idLinksBases = TRUE, 
       scale.penalty = TRUE, paraPen = NULL, drop.intercept = FALSE)
2: do.call(gsname, list(formula = gp, pterms = pterms, data = mf, 
       knots = knots, sp = sp, min.sp = min.sp, H = H, absorb.cons = TRUE, 
       sparse.cons = 0, select = select, idLinksBases = control$idLinksBases, 
       scale.penalty = control$scalePenalty, paraPen = paraPen, 
       drop.intercept = drop.intercept))
1: gam(test$y ~ s(test$x), k = 2)

And session info

 sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mgcv_1.8-25   nlme_3.1-137  ggplot2_3.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19     rstudioapi_0.8   bindr_0.1.1      knitr_1.20       magrittr_1.5     tidyselect_0.2.5 munsell_0.5.0   
 [8] lattice_0.20-38  colorspace_1.3-2 R6_2.3.0         rlang_0.3.0.1    plyr_1.8.4       dplyr_0.7.7      tools_3.5.1     
[15] grid_3.5.1       packrat_0.4.9-3  gtable_0.2.0     withr_2.1.2      yaml_2.2.0       lazyeval_0.2.1   assertthat_0.2.0
[22] tibble_1.4.2     crayon_1.3.4     Matrix_1.2-15    bindrcpp_0.2.2   purrr_0.2.5      glue_1.3.0       labeling_0.3    
[29] compiler_3.5.1   pillar_1.3.0     scales_1.0.0     pkgconfig_2.0.2

I am wondering if anyone has suggestions about how I might best deal with this error or otherwise get a working gam with this data. Thanks for ideas.


Solution

  • If you use k=3 there's no problem (at least with the method I used which passes the dataframe to the data argument>

    > mod <- gam(y ~ s(x, k=3), data=test)
    > mod
    
    Family: gaussian 
    Link function: identity 
    
    Formula:
    y ~ s(x, k = 3)
    
    Estimated degrees of freedom:
    1.95  total = 2.95 
    
    GCV score: 9.922357     
    > mod <- gam(y ~ s(x, k=4), data=test)
    > 
    > mod
    
    Family: gaussian 
    Link function: identity 
    
    Formula:
    y ~ s(x, k = 4)
    
    Estimated degrees of freedom:
    2.67  total = 3.67 
    
    GCV score: 7.77551     
    > mod <- gam(y ~ s(x, k=5), data=test)
    > mod
    
    Family: gaussian 
    Link function: identity 
    
    Formula:
    y ~ s(x, k = 5)
    
    Estimated degrees of freedom:
    2.94  total = 3.94 
    
    GCV score: 7.758121 
    

    Since the GCV score doesn't increase when increasing from 4 to 5 I'm showing what you get with k=4:

     > mod <- gam(y ~ s(x, k=4), data=test)
    > png()
    > plot(mod)
    > points(y~x, data=test)
    > dev.off()
    RStudioGD 
            2 
    

    enter image description here

    I'm pretty sure that the reason for the offset of s(x,2.67) from the points is failure to include the intercept term, which would be another (unplotted) "degree of freedom". If you want the predictions of the full model (which will go through the data points, rather than just the contribution from the smoothing term, then plot predict(mod, newdata=list( x= seq(min(x),max(x), length=100))) against that same sequence.