I'm fitting some GAMs with mgcv where I need a high k
value to capture complicated behavior in some cases. But I've noticed that sometimes (some datasets), fitting the GAM takes forever when k
is high, and it seems like in these cases, I also don't get convergence. I need to try fitting lots of datasets and can't wait three days when I stumble on one of these datasets where it can take forever and ultimately fail!
I encountered R.utils::withTimeout()
, which seemed like a promising tool to ensure I move on if I hit one of these gam time traps, but it's acting inconsistently for me. Below is a printout of running the same script three times consecutively. Note that the first of these three times, the timeout apparently did not happen. I vaguely understand that there are situations where withTimeout
is expected to fail... I'd like to know how I can enforce a hard stop on mgcv::gam
, potentially with a different tool (I saw a reference to the package processx).
This looks like a lot but it's exact repetitions; session info below as well. Key point from the code below is that withTimeout
apparently did nothing the first time (took 30 seconds) and then kicks in each successive time (8 seconds seems close enough to 6).
Thanks!
> tictoc::tic()
> set.seed(2) ## simulate some data...
> dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
> summary(b)
Family: gaussian
Link function: identity
Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3,
k = myk)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.84735 0.02869 273.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x0) 5.057 6.321 87.162 <2e-16 ***
s(x1) 3.739 4.671 813.470 <2e-16 ***
s(x2) 20.716 25.865 354.052 <2e-16 ***
s(x3) 3.298 4.121 1.496 0.197
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.728 Deviance explained = 73%
GCV = 4.1439 Scale est. = 4.1158 n = 5000
> tictoc::toc()
30.418 sec elapsed
> # just did this script twice; this part takes me 30 seconds
>
> tmax <- 6
>
> tictoc::tic()
> gf <- tryCatch({
+ res <- R.utils::withTimeout({
+ set.seed(2) ## simulate some data...
+ dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
+ b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
+ summary(b)
+ }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+ message(paste("Timeout before gam fit complete (should take", tmax, "seconds)"))
+
+ })
Gu & Wahba 4 term additive model
> tictoc::toc()
30.332 sec elapsed
> myk <- 120
>
> tictoc::tic()
> set.seed(2) ## simulate some data...
> dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
> summary(b)
Family: gaussian
Link function: identity
Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3,
k = myk)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.84735 0.02869 273.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x0) 5.057 6.321 87.162 <2e-16 ***
s(x1) 3.739 4.671 813.470 <2e-16 ***
s(x2) 20.716 25.865 354.052 <2e-16 ***
s(x3) 3.298 4.121 1.496 0.197
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.728 Deviance explained = 73%
GCV = 4.1439 Scale est. = 4.1158 n = 5000
> tictoc::toc()
30.415 sec elapsed
> # just did this script twice; this part takes me 30 seconds
>
> # I expect the next part to take 6 seconds based on tmax.
> # dpeending on n and k, it either does, or it can take WAYYY longer. How can I
> # guarantee 6 seconds.
> tmax <- 6
>
> tictoc::tic()
> gf <- tryCatch({
+ res <- R.utils::withTimeout({
+ set.seed(2) ## simulate some data...
+ dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
+ b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
+ summary(b)
+ }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+ message(paste("Timeout before gam fit complete (should take", tmax, "seconds)"))
+
+ })
Gu & Wahba 4 term additive model
Timeout before gam fit complete (should take 6 seconds)
> tictoc::toc()
8.112 sec elapsed
> # Test timeout with mgcv::gam
> myk <- 120
>
> tictoc::tic()
> set.seed(2) ## simulate some data...
> dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
> summary(b)
Family: gaussian
Link function: identity
Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3,
k = myk)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.84735 0.02869 273.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x0) 5.057 6.321 87.162 <2e-16 ***
s(x1) 3.739 4.671 813.470 <2e-16 ***
s(x2) 20.716 25.865 354.052 <2e-16 ***
s(x3) 3.298 4.121 1.496 0.197
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.728 Deviance explained = 73%
GCV = 4.1439 Scale est. = 4.1158 n = 5000
> tictoc::toc()
30.318 sec elapsed
> # just did this script twice; this part takes me 30 seconds
>
> # I expect the next part to take 6 seconds based on tmax.
> # dpeending on n and k, it either does, or it can take WAYYY longer. How can I
> # guarantee 6 seconds.
> tmax <- 6
>
> tictoc::tic()
> gf <- tryCatch({
+ res <- R.utils::withTimeout({
+ set.seed(2) ## simulate some data...
+ dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
+ b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
+ summary(b)
+ }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+ message(paste("Timeout before gam fit complete (should take", tmax, "seconds)"))
+
+ })
Gu & Wahba 4 term additive model
Timeout before gam fit complete (should take 6 seconds)
> tictoc::toc()
8.04 sec elapsed
> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.10 compiler_4.2.3 pillar_1.8.1 R.methodsS3_1.8.2 R.utils_2.12.2 tools_4.2.3 mvnfast_0.2.8 digest_0.6.31 evaluate_0.20 lubridate_1.9.2
[11] lifecycle_1.0.3 tibble_3.2.1 nlme_3.1-162 gtable_0.3.2 lattice_0.20-45 timechange_0.2.0 mgcv_1.8-42 pkgconfig_2.0.3 rlang_1.1.0 Matrix_1.5-3
[21] cli_3.6.0 rstudioapi_0.14 patchwork_1.1.2.9000 yaml_2.3.7 parallel_4.2.3 xfun_0.37 fastmap_1.1.1 gratia_0.8.1 knitr_1.42 stringr_1.5.0
[31] furrr_0.3.1 dplyr_1.1.0 generics_0.1.3 vctrs_0.6.0 globals_0.16.2 tictoc_1.1 grid_4.2.3 tidyselect_1.2.0 glue_1.6.2 listenv_0.9.0
[41] R6_2.5.1 fansi_1.0.4 parallelly_1.34.0 rmarkdown_2.20 tidyr_1.3.0 purrr_1.0.1 ggplot2_3.4.1 magrittr_2.0.3 htmltools_0.5.4 scales_1.2.1
[51] codetools_0.2-19 splines_4.2.3 colorspace_2.1-0 future_1.32.0 utf8_1.2.3 stringi_1.7.12 munsell_0.5.0 R.oo_1.25.0
If the R process hangs in C code that doesn't check for interrupts, then you might be out of luck with setTimeLimit
(or utilities based on that). setSessionTimeLimit
also relies on checks for interrupts, so setting a timeout that way, on the R process itself or on an R subprocess, could likewise fail.
In that case, you can set a system-level timeout on the R (sub)process using, e.g., the timeout
command on Linux. As an example:
/* startInfiniteLoop.c */
#include <Rinternals.h>
SEXP startInfiniteLoop(void) {
while (1) ;
return R_NilValue;
}
## startInfiniteLoop.R
tools::Rcmd(c("SHLIB", "startInfiniteLoop.c"))
dyn.load("startInfiniteLoop.so")
.Call("startInfiniteLoop")
This R process hangs and you need to manually kill it:
$ R -f startInfiniteLoop.R
This one also hangs but is killed automatically after 20 seconds:
$ timeout 20 R -f startInfiniteLoop.R
You can invoke timeout
from an R process using system2
, so that timeouts on R subprocesses are specified entirely in your R script, if that's what you prefer:
system2("timeout", c("20", "R", "-f", "startInfiniteLoop.R"))
Obviously, your use case is more complicated. You probably need to pass variables to and from the subprocess, and so on, but maybe you can fill in those details yourself.