Search code examples
rtimeouttry-catchmgcv

guarantee timeout with `mgcv::gam`


I'm fitting some GAMs with mgcv where I need a high k value to capture complicated behavior in some cases. But I've noticed that sometimes (some datasets), fitting the GAM takes forever when k is high, and it seems like in these cases, I also don't get convergence. I need to try fitting lots of datasets and can't wait three days when I stumble on one of these datasets where it can take forever and ultimately fail!

I encountered R.utils::withTimeout(), which seemed like a promising tool to ensure I move on if I hit one of these gam time traps, but it's acting inconsistently for me. Below is a printout of running the same script three times consecutively. Note that the first of these three times, the timeout apparently did not happen. I vaguely understand that there are situations where withTimeout is expected to fail... I'd like to know how I can enforce a hard stop on mgcv::gam, potentially with a different tool (I saw a reference to the package processx).

This looks like a lot but it's exact repetitions; session info below as well. Key point from the code below is that withTimeout apparently did nothing the first time (took 30 seconds) and then kicks in each successive time (8 seconds seems close enough to 6).

Thanks!

> tictoc::tic()
> set.seed(2) ## simulate some data... 
> dat <- mgcv::gamSim(1
+               , n = 5000
+               , dist = "normal"
+               , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+          , data = dat)
> summary(b)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, 
    k = myk)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.84735    0.02869   273.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
         edf Ref.df       F p-value    
s(x0)  5.057  6.321  87.162  <2e-16 ***
s(x1)  3.739  4.671 813.470  <2e-16 ***
s(x2) 20.716 25.865 354.052  <2e-16 ***
s(x3)  3.298  4.121   1.496   0.197    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.728   Deviance explained =   73%
GCV = 4.1439  Scale est. = 4.1158    n = 5000
> tictoc::toc()
30.418 sec elapsed
>  # just did this script twice; this part takes me 30 seconds
> 
> tmax <- 6
> 
> tictoc::tic()
> gf <-  tryCatch({
+   res <- R.utils::withTimeout({
+     set.seed(2) ## simulate some data... 
+     dat <- mgcv::gamSim(1
+                   , n = 5000
+                   , dist = "normal"
+                   , scale = 2)
+     b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+              , data = dat)
+     summary(b)
+   }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+   message(paste("Timeout before gam fit complete (should take", tmax,  "seconds)"))
+   
+ })
Gu & Wahba 4 term additive model
> tictoc::toc()
30.332 sec elapsed
> myk <- 120
> 
> tictoc::tic()
> set.seed(2) ## simulate some data... 
> dat <- mgcv::gamSim(1
+               , n = 5000
+               , dist = "normal"
+               , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+          , data = dat)
> summary(b)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, 
    k = myk)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.84735    0.02869   273.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
         edf Ref.df       F p-value    
s(x0)  5.057  6.321  87.162  <2e-16 ***
s(x1)  3.739  4.671 813.470  <2e-16 ***
s(x2) 20.716 25.865 354.052  <2e-16 ***
s(x3)  3.298  4.121   1.496   0.197    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.728   Deviance explained =   73%
GCV = 4.1439  Scale est. = 4.1158    n = 5000
> tictoc::toc()
30.415 sec elapsed
>  # just did this script twice; this part takes me 30 seconds
> 
> # I expect the next part to take 6 seconds based on tmax.
> # dpeending on n and k, it either does, or it can take WAYYY longer. How can I 
> # guarantee 6 seconds.
> tmax <- 6
> 
> tictoc::tic()
> gf <-  tryCatch({
+   res <- R.utils::withTimeout({
+     set.seed(2) ## simulate some data... 
+     dat <- mgcv::gamSim(1
+                   , n = 5000
+                   , dist = "normal"
+                   , scale = 2)
+     b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+              , data = dat)
+     summary(b)
+   }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+   message(paste("Timeout before gam fit complete (should take", tmax,  "seconds)"))
+   
+ })
Gu & Wahba 4 term additive model
Timeout before gam fit complete (should take 6 seconds)
> tictoc::toc()
8.112 sec elapsed
> # Test timeout with mgcv::gam
> myk <- 120
> 
> tictoc::tic()
> set.seed(2) ## simulate some data... 
> dat <- mgcv::gamSim(1
+               , n = 5000
+               , dist = "normal"
+               , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+          , data = dat)
> summary(b)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, 
    k = myk)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.84735    0.02869   273.5   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
         edf Ref.df       F p-value    
s(x0)  5.057  6.321  87.162  <2e-16 ***
s(x1)  3.739  4.671 813.470  <2e-16 ***
s(x2) 20.716 25.865 354.052  <2e-16 ***
s(x3)  3.298  4.121   1.496   0.197    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.728   Deviance explained =   73%
GCV = 4.1439  Scale est. = 4.1158    n = 5000
> tictoc::toc()
30.318 sec elapsed
>  # just did this script twice; this part takes me 30 seconds
> 
> # I expect the next part to take 6 seconds based on tmax.
> # dpeending on n and k, it either does, or it can take WAYYY longer. How can I 
> # guarantee 6 seconds.
> tmax <- 6
> 
> tictoc::tic()
> gf <-  tryCatch({
+   res <- R.utils::withTimeout({
+     set.seed(2) ## simulate some data... 
+     dat <- mgcv::gamSim(1
+                   , n = 5000
+                   , dist = "normal"
+                   , scale = 2)
+     b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+              , data = dat)
+     summary(b)
+   }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+   message(paste("Timeout before gam fit complete (should take", tmax,  "seconds)"))
+   
+ })
Gu & Wahba 4 term additive model
Timeout before gam fit complete (should take 6 seconds)
> tictoc::toc()
8.04 sec elapsed
> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10          compiler_4.2.3       pillar_1.8.1         R.methodsS3_1.8.2    R.utils_2.12.2       tools_4.2.3          mvnfast_0.2.8        digest_0.6.31        evaluate_0.20        lubridate_1.9.2     
[11] lifecycle_1.0.3      tibble_3.2.1         nlme_3.1-162         gtable_0.3.2         lattice_0.20-45      timechange_0.2.0     mgcv_1.8-42          pkgconfig_2.0.3      rlang_1.1.0          Matrix_1.5-3        
[21] cli_3.6.0            rstudioapi_0.14      patchwork_1.1.2.9000 yaml_2.3.7           parallel_4.2.3       xfun_0.37            fastmap_1.1.1        gratia_0.8.1         knitr_1.42           stringr_1.5.0       
[31] furrr_0.3.1          dplyr_1.1.0          generics_0.1.3       vctrs_0.6.0          globals_0.16.2       tictoc_1.1           grid_4.2.3           tidyselect_1.2.0     glue_1.6.2           listenv_0.9.0       
[41] R6_2.5.1             fansi_1.0.4          parallelly_1.34.0    rmarkdown_2.20       tidyr_1.3.0          purrr_1.0.1          ggplot2_3.4.1        magrittr_2.0.3       htmltools_0.5.4      scales_1.2.1        
[51] codetools_0.2-19     splines_4.2.3        colorspace_2.1-0     future_1.32.0        utf8_1.2.3           stringi_1.7.12       munsell_0.5.0        R.oo_1.25.0  

Solution

  • If the R process hangs in C code that doesn't check for interrupts, then you might be out of luck with setTimeLimit (or utilities based on that). setSessionTimeLimit also relies on checks for interrupts, so setting a timeout that way, on the R process itself or on an R subprocess, could likewise fail.

    In that case, you can set a system-level timeout on the R (sub)process using, e.g., the timeout command on Linux. As an example:

    /* startInfiniteLoop.c */
    #include <Rinternals.h>
    SEXP startInfiniteLoop(void) {
        while (1) ;
        return R_NilValue;
    }
    
    ## startInfiniteLoop.R
    tools::Rcmd(c("SHLIB", "startInfiniteLoop.c"))
    dyn.load("startInfiniteLoop.so")
    .Call("startInfiniteLoop")
    

    This R process hangs and you need to manually kill it:

    $ R -f startInfiniteLoop.R
    

    This one also hangs but is killed automatically after 20 seconds:

    $ timeout 20 R -f startInfiniteLoop.R
    

    You can invoke timeout from an R process using system2, so that timeouts on R subprocesses are specified entirely in your R script, if that's what you prefer:

    system2("timeout", c("20", "R", "-f", "startInfiniteLoop.R"))
    

    Obviously, your use case is more complicated. You probably need to pass variables to and from the subprocess, and so on, but maybe you can fill in those details yourself.