How do estimation commands find variable names in formulas in R?

I'd like to estimate a large number of models using R's nls() function on a user-defined function. Since many variables are fixed across my specifications, I'd like a way of pre-setting them in my function, but I don't properly understand how R looks for variables in functions contained in a formula.

I've had a look at the section on metaprogramming in Hadley Wickham's advanced R book, but it hasn't enlightened me. Here is a simplified example of what I'm trying to acheive, using the mtcars dataset:

I have tried setting a default value for variables that are fixed across specificaitons:

expo <- function(x, theta, weight = wt) {
  x*weight^theta
}

I have also tried just using the column name of the fixed variable as a variable inside the function

expo <- function(x, theta) {
  x*wt^theta
}

Both these approaches work if I just want to calculate the function, say with

attach(mtcars)
expo(qsec, 1)
detach()

But if I try using my expo() function in an estimation routine, for example

nls(mpg ~ phi + expo(qsec, theta),
    data = mtcars,
    start = c('phi' = -2, 'theta' = 1))

It fails with the message Error in expo(qsec, theta) : object 'wt' not found. One possibility, brought up in the comments, is to simply pass the dataset, mtcars in this case, to expo() as an argument. But since I will only call expo() inside of a call to nls() where the dataset is already an argument, I would be happy if I could find a way to avoid this repetition.

My ultimate goal after defining or calling expo() appropriately is to be able to do something like this:

depvars <- c('qsec', 'drat', 'dist')
lapply <- (depvars, function(x) {
    formula <- as.formula(paste0('mpg ~ phi + expo(', x, ', theta)'))
    nls(formula,
        data = mtcars,
        start = c('phi' = -2, 'theta' = 1))
}

Solution

The tricky thing is that R's lexical scoping searches in enclosing environments, which can be confusing during calls because the caller environments can each have enclosing environments and things get confusing pretty quickly.

I'll be using the rlang package to debug this scenario.

First, if you defined expo in the global environment, then that will be its enclosing environment:

expo <- function(x, theta) {
  x*wt^theta
}

rlang::get_env(expo)
# <environment: R_GlobalEnv>

So when you call it, R will first search for variables in the function's call (not caller!) environment, and then in the enclosing environment (global environment here).

I don't know what nls does exactly, but I would have assumed that it creates an environment from the data you provide and evaluates the formula there. However, it seems the environment it creates only contains the variables it can explicitly see in the formula, something I found with:

expo <- function(x, theta) {
  cat("caller: ")
  print(ls(rlang::caller_env()))
  cat("enclosing: ")
  print(ls(rlang::env_parent(rlang::current_env())))
}

nls(mpg ~ phi + expo(qsec, theta),
    data = mtcars,
    start = c('phi' = -2, 'theta' = 1))
# caller: [1] "mpg"   "phi"   "qsec"  "theta"
# enclosing: [1] "expo"    
# Error ...

As we can see, the caller environment of expo contains the variables we can identify in the formula, and its enclosing environment only contains the definition of expo (the global environment). This unfortunately means that you can't even use something like eval.parent inside expo, because that environment won't have all variables from data.

If you still want to work around it, you could modify expo's enclosing environment with your data before calling nls, something like:

expo <- function(x, theta) {
  x*wt^theta
}

environment(expo) <- list2env(as.list(mtcars))

nls(mpg ~ phi + expo(qsec, theta),
    data = mtcars,
    start = c('phi' = -2, 'theta' = 1))
# Error ... number of iterations exceeded maximum of 50