Search code examples
rdataframenestedtidyrnls

R - Using nested dataframe to run function with different sets of parameters


I would like to create a wrapper for the Levenberg-Marquardt Nonlinear Least-Squares function nls.lm (minpack.lm library) similar to nls2 (nls2 library) to give a brute force method for evaluating the fit of a model to observed data.

The idea is to create a range of starting value combinations and either:

  • pass these to a function, then compare the function output to the observed data to create an R^2 value for each of the starting value combinations and run the nls.lm fitting with the best one of them.

or

  • run nls.lm on all combinations and select the best returned fit.

I wanted to do this without looping and after inspiration from here am trying to use nested dataframes, with one column for the parameter input list, one for the values returned by my function, one for the R^2 values, and one for the best fit models,something like:

df
#   start_val fun_out       R^2   
# 1 {a=2,b=2} {22,24,26...} 0.8   
# 2 {a=3,b=5} {35,38,41...} 0.6   

This is the code I have so far:

require(dplyr);require(tidyr)

foo <- function(x,a,b) a*x^2+b # function I am fitting
x <- 1:10 # independent variable
y_obs <- foo(x,1.5,2.5) + rnorm(length(x),0,10) # observed data (dependent variable)

start_range <- data.frame(a=c(1,2),b=c(2,3)) # range of allowed starting points for fitting
reps <- 2 # number of starting points to generate

# Create a data frame of starting points
df<-as.data.frame(sapply(start_range, function(x) runif(reps,min=x[[1]],max=x[[2]]))) %>%
  mutate(id=seq_len(reps)) %>% # fudge to make nest behave as I want
  nest(1:ncol(start_range)) %>%
  mutate(data=as.list(data)) %>%
  as.data.frame()

df
#   id               data
# 1  1 1.316356, 2.662923
# 2  2 1.059356, 2.723081

I get stuck now trying to pass the parameters in data into the function foo(). I've tried using do.call(), and even with using constant parameters the following error appears:

mutate(df,y=do.call(foo,list(x,1,2)))
# Error: wrong result size (5), expected 2 or 1

Is there a way to create columns of a dataframe which contain lists directly without using nest()?

Also when trying to create the list to pass to do.call() using the dataframe columns, how do you create a list where the first element is the vector x, the second is the parameter a and the third is the parameter b? The follwing splits the list down the column:

mutate(df,my_list=list(x,data))
#   id               data                                my_list
# 1  1 1.316356, 2.662923          1, 2, 3, 4, 5, 6, 7, 8, 9, 10
# 2  2 1.059356, 2.723081 1.316356, 2.662923, 1.059356, 2.723081

Solution

  • Running nls2 using algorithm = "random-search" and all = TRUE and the specified maxiter will evaluate foo at maxiter random points and return starting_fits which are the fits at those points. It consists of a set of "nls" class objects evaluated at each of the randomly chosen starting values. It does not do an optimization from each of these starting values but just returns the "nls" object at each. That is, nls is not run. Now for each starting fit run nlsLM giving fits, a list of nlsLM fits and from that summarize them in data (a data frame with one row per run) and show the least.

    If we only want to pick the best starting value and just run nlsLM once from that then use the alternate code near the end.

    library(nls2)
    
    fo <- y_obs ~ foo(x, a, b)
    starting_fits <- nls2(fo, algorithm = "random-search", 
     start = start_range, control = nls.control(maxiter = reps), all = TRUE)
    
    fits <- lapply(starting_fits, function(fit) nlsLM(fo, start = coef(fit)))
    
    data <- data.frame(RSS = sapply(fits, deviance), t(sapply(fits, coef)),
       start = t(sapply(starting_fits, coef)))
    # data$fits <- fits   # optional to store each row's fitted object in that row
    subset(data, RSS == min(RSS))   # minimum(s)
    

    giving:

           RSS        a        b  start.a  start.b
    2 706.3956 1.396616 7.226525 1.681819 2.768374
    

    R squared is used for linear regression. It is not valid for nonlinear regression. Residual sum of squares (RSS) is shown above instead.

    Alternately if you just want to pick out the best starting value and run nlsLM on that then just omit the all=TRUE argument from the nls2 call and do this. If you need the coefficients and RSS for later code then try coef(fit) and deviance(fit) .

    starting_fit <- nls2(fo, algorithm = "random-search", 
     start = start_range, control = nls.control(maxiter = reps))
    
    fit <- nlsLM(fo, start = coef(starting_fit))
    

    Note 1: If you are getting errors from nlsLM try replacing nlsLM(...) with try(nlsLM(...)). This will issue error messages (use try(..., silent = TRUE) if you don't want them) but will not stop processing.

    Note 2: I assume that the foo shown in the question is just an example and the real function is more complex. The foo shown is linear in the coefficients so one could use lm for it. No nonlinear optimization is needed.