How to use nls with caret to do cross-validation

I have fit several models using nls to the same data and am trying to figure out how to use caret to do K-fold cross-validation (eg., here). This SO question asked a general question about using nls in caret for a single model. However, the answer referred them to this resource, which is beyond my understanding (ie, how to adapt for nls), especially for fitting several different nls functions.

Here is example data and examples of the model objects fit with nls:

library("caret")

df <-
  structure(list(c = c(123.86, 208.75, 141.5, 230.73, 143.4, 209.31, 
  161.15, 130.87, 232.05, 121.61, 176.31, 139.01, 131.92, 156.61, 
  150.05, 121, 134.12, 146.83, 181.39, 115, 147.87, 161.49, 107.65, 
  115.51, 144.11), q = c(0.028, 0.004, 0.049, 0.001, 0.049, 0.004, 
  0.016, 0.015, 0.003, 0.026, 0.002, 0.009, 0.148, 0.012, 0.017, 
  0.086, 0.02, 0.038, 0.003, 0.031, 0.011, 0.032, 0.132, 0.093, 
  0.026)), row.names = c(NA, -25L), class = c("tbl_df", "tbl", 
  "data.frame"))

# Model 1
eq1 <- function(q,a,n) (a*q**(-1/n))
eq1_fit <- nls(c ~ eq1(q,a,n), data=df,start=list(a=380, n=5))

# Model 2
eq2 <- function(q,h,g,n,c0) (h*exp(-g*q**(1/n))+c0)
eq2_fit <- nls(c ~ eq2(q,h,g,n=5,c0=6), data=df,start=list(h=100,g=1))

There are several additional models all fit using nls on the same data that I would like to do 5-fold or 10-fold CV on following the example or something similar to shown here. A solution using tidymodels and workflowsets would also work too. Any help is appreciated!

Solution

You could do this with the vfold_cv() function in the rsample package, which is in the tidymodels ecosystem.

df <-
  structure(list(c = c(123.86, 208.75, 141.5, 230.73, 143.4, 209.31, 
                       161.15, 130.87, 232.05, 121.61, 176.31, 139.01, 131.92, 156.61, 
                       150.05, 121, 134.12, 146.83, 181.39, 115, 147.87, 161.49, 107.65, 
                       115.51, 144.11), q = c(0.028, 0.004, 0.049, 0.001, 0.049, 0.004, 
                                              0.016, 0.015, 0.003, 0.026, 0.002, 0.009, 0.148, 0.012, 0.017, 
                                              0.086, 0.02, 0.038, 0.003, 0.031, 0.011, 0.032, 0.132, 0.093, 
                                              0.026)), row.names = c(NA, -25L), class = c("tbl_df", "tbl", 
                                                                                          "data.frame"))


library(rsample)
library(tidyverse)
## define equations 
eq1 <- function(q,a,n) (a*q**(-1/n))
eq2 <- function(q,h,g,n,c0) (h*exp(-g*q**(1/n))+c0)


eq1_fit <- nls(c ~ eq1(q,a,n), data=df,start=list(a=380, n=5))
eq2_fit <- nls(c ~ eq2(q,h,g,n=5,c0=6), data=df,start=list(h=100,g=1))

y <- df$c
## create the squared errors getting model predictions
## from the assessment partition
r1 <- (y - predict(eq1_fit, newdata=df))
r2 <- (y - predict(eq2_fit, newdata=df))


e1 <- (y - predict(eq1_fit, newdata=df))^2
e2 <- (y - predict(eq2_fit, newdata=df))^2





## define function that will do the model fitting and assessment
cv_mods <- function(split, ...){
  ## fit models with the analysis partition
  eq1_fit <- nls(c ~ eq1(q,a,n), data=analysis(split),start=list(a=380, n=5))
  eq2_fit <- nls(c ~ eq2(q,h,g,n=5,c0=6), data=analysis(split),start=list(h=100,g=1))
  ## take the dependent variable from the assessment partition
  y <- assessment(split)$c
  ## create the residuals  getting model predictions
  ## from the assessment partition
  e1 <- (y - predict(eq1_fit, newdata=assessment(split)))
  e2 <- (y - predict(eq2_fit, newdata=assessment(split)))
  ## return the cross-validated residuals from both models as 
  ## a data frame. 
  data.frame(e1 = e1, e2 = e2)
}

## estimate the cross-validation 
## the vfold_cv function sets up the cross-validation partitions
## I used 10 repeats here for speed, but in the "real world" you 
## would probably want lots more 
out <- vfold_cv(df,
                v=10,
                repeats = 10) %>%
  ## estimate the cv on all of the partitions
  mutate(err = map(splits,
                   cv_mods)) %>% 
  ## unnest the error column to turn it into two 
  ## columns in your data frame
  unnest(err)  

## First analysis: Pr(mod1 better than mod2)
out1 <- out %>% 
  ## group by repeat 
  group_by(id) %>% 
  ## calculate the standard deviation of the errors for each level of id
  summarise(across(e1:e2, ~sd(.x))) %>% 
  ## calculate the probability across repeats that model 1 is better
  ## than model 2
  summarise(p_eq1_better = mean(e1<e2))

out1
#> # A tibble: 1 × 1
#>   p_eq1_better
#>          <dbl>
#> 1            1


## alternatively, follow similar steps to above, 
## but get the average cross-validation error 
## for each model: 
out2 <- out %>% group_by(id) %>% 
  ## sum up the sums of squared errors across partitions
  summarise(across(e1:e2, ~sd(.x))) %>% 
  ## calculate average CV error:
  summarise(across(e1:e2, ~mean(.x)))

out2
#> # A tibble: 1 × 2
#>      e1    e2
#>   <dbl> <dbl>
#> 1  19.7  21.2

^{Created on 2022-04-12 by the reprex package (v2.0.1)}