What is the difference among eval(parse()), eval(str2lang()), eval(str2expression()), and eval(call()[[1]])?

I'm writing a function to extract a variable provided as a string from either a given data.frame df or the environment env. Initially, I had been using the eval(parse(text=s), df, env) construction to do this, but I learned that there are more efficient alternatives. Other options include:

eval(str2lang(s), df, env)
eval(str2expression(s), df, env)
eval(call(s)[[1]], df, env)

There may be a get solution as well, but I don't know if it can check to see if the variable is in df first before turning to env if it isn't.

Using microbenchmark, it seems that call is the fastest:

library(microbenchmark)

x1 = 1
df = data.frame(x2 = 2)

microbenchmark(call = eval(call('x1')[[1]], df), 
               parse = eval(parse(text='x1'), df), 
               str2lang = eval(str2lang('x1'), df), 
               str2exp = eval(str2expression('x1'), df), 
               check = "identical")
#> Unit: microseconds
#>      expr    min      lq     mean  median      uq     max neval cld
#>      call  1.128  1.2115  1.60815  1.4585  1.6360   4.659   100  a 
#>     parse 39.183 39.8705 46.60755 40.2405 42.0415 135.462   100   b
#>  str2lang  2.235  2.3570  3.26144  2.5995  2.8925  24.641   100  a 
#>   str2exp  2.230  2.3200  2.81387  2.4780  2.6970  10.312   100  a

microbenchmark(call = eval(call('x2')[[1]], df), 
               parse = eval(parse(text='x2'), df), 
               str2lang = eval(str2lang('x2'), df), 
               str2exp = eval(str2expression('x2'), df), 
               check = "identical")
#> Unit: microseconds
#>      expr    min     lq     mean  median      uq     max neval cld
#>      call  1.124  1.194  1.47770  1.3675  1.5795   9.031   100  a 
#>     parse 38.254 38.762 40.21497 38.9630 39.3120 116.510   100   b
#>  str2lang  2.214  2.304  2.55036  2.3960  2.6530  10.639   100  a 
#>   str2exp  2.238  2.331  2.50011  2.4210  2.6515   3.619   100  a

^{Created on 2020-04-23 by the reprex package (v0.3.0)}

I'm therefore inclined to use call but I want to make sure there wouldn't be any unintended consequences of doing so rather than using the other solutions. In other words, in what situations (within the context I'm using them in) would the four methods not give the same answer, leading one to favor one over the others?

Solution

Both call and as.symbol are fine, I think as.symbol is preferable. as.symbol is unambiguous R terminology (as.name was S terminology) and is as fast as call. A call can contain any R object, including symbols and other calls, plus documentation is all over the place. Because the strings could contain punctuation we can not be sure at which point str2lang, or str2expression fail or what error message might pop up.

Checking if variables exist in an environment is safer than doing so for data.frames.

Differences

a look under the hood reveals main differences are in their classes:

l = list(w = as.symbol("x"), x = str2lang("x"), y = str2expression("x"), z = call("x"))
rbind(sapply(l, class), sapply(l, typeof), sapply(l, is.language))
     w        x        y            z         
[1,] "name"   "name"   "expression" "call"    
[2,] "symbol" "symbol" "expression" "language"
[3,] "TRUE"   "TRUE"   "TRUE"       "TRUE"

A good starting point is the R Language Definition, and for examples the Expressions chapter in Hadley Wickham's Advanced R (source), followed by reading ?call, ?parse, rlang::call2.

What are they?

In Evaluating the Design of the R Language (source) the following definition is given:

Arguments of calls are expressions which may be named by a symbol ~~name~~.

Expression - an action or actions.
Call - represent the action of calling a function.
Symbols (Name) - refer to R objects (2.1.3)

Which one to use when?

From the ?call documentation:

Instead of as.call(<string>), consider using str2lang(*) which is an efficient version of parse(text=*). call() and as.call(), are much preferable to parse() based approaches.

Using pryr::show_c_source(.Internal(str2lang(s))) we can see that str2lang and str2expression both call the C function do_str2lang but differ in their argument. This difference can be found in ?parse:

str2expression, equal to parse(text = "x", keep.source = F) is always an expression.
str2lang("x"), equal to parse(text = "x", keep.source = F)[[1]] can evaluate to a call OR a symbol, NULL, numeric, integer or logical, i.e. a call or simpler.

In the R Language Definition it is mentioned that expressions are only evaluated when passed to eval, other language objects may get evaluated in some unexpected cases. What these cases are, I could not find.

We can throw some edge cases at all approaches to see when which might fail:

"\"" wrongly parsed quote when reading in data?
" " empty variable? space in variable name?
"_" illegal token (legacy assignment operator) issues?
backtick
NA

# if it breaks, show how it breaks
do <- function(x, ...) tryCatch(eval(x, ..1, ..2), error = function(t) t$message)
check <- function(x, ...){
  list(
    do(call(x)[[1]], ...),
    do(as.symbol(x), ...),
    do(str2lang(x), ...),
    do(str2expression(x), ...)
  )
}

# test 1: some variables do not exist
e <- new.env() ; e$x1 <- 5
df = data.frame(x2 = 3)
no_var <- lapply(list("x1", "x2", "\"", " ", "_", "`", NA_character_), check, df, e)

# test 2: some variables exist
e <- new.env() ; e$x1 = 5; e$`_` = 2 ; e$"\"" = 5; e$`NA` <- 5
df = data.frame(x2 = 3, " " = 7, "_" = 2)
var <- lapply(list("x1", "x2", "\"", " ", "_", "`", NA_character_), check, df, e)

# difference in outcomes in
var[3:7]
no_var[3:7]

When variables do not exist: as.symbol and call always reach eval, str2lang and str2expression fail early in all cases. str2expression differs from str2lang when an empty string is given.

When variables exist: as.symbol and call succeed in cases 3, 4, 7 while str2lang and str2expression throw errors.

On a sidenote, be careful with storing variables in data.frames, they are easy to corrupt.

names(data.frame(" " = 1, "_" = 2, "\"" = 3, "`" = 4))
[1] "X."   "X_"   "X..1" "X..2"