I'm writing a function to extract a variable provided as a string from either a given data.frame df
or the environment env
. Initially, I had been using the eval(parse(text=s), df, env)
construction to do this, but I learned that there are more efficient alternatives. Other options include:
eval(str2lang(s), df, env)
eval(str2expression(s), df, env)
eval(call(s)[[1]], df, env)
There may be a get
solution as well, but I don't know if it can check to see if the variable is in df
first before turning to env
if it isn't.
Using microbenchmark
, it seems that call
is the fastest:
library(microbenchmark)
x1 = 1
df = data.frame(x2 = 2)
microbenchmark(call = eval(call('x1')[[1]], df),
parse = eval(parse(text='x1'), df),
str2lang = eval(str2lang('x1'), df),
str2exp = eval(str2expression('x1'), df),
check = "identical")
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> call 1.128 1.2115 1.60815 1.4585 1.6360 4.659 100 a
#> parse 39.183 39.8705 46.60755 40.2405 42.0415 135.462 100 b
#> str2lang 2.235 2.3570 3.26144 2.5995 2.8925 24.641 100 a
#> str2exp 2.230 2.3200 2.81387 2.4780 2.6970 10.312 100 a
microbenchmark(call = eval(call('x2')[[1]], df),
parse = eval(parse(text='x2'), df),
str2lang = eval(str2lang('x2'), df),
str2exp = eval(str2expression('x2'), df),
check = "identical")
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> call 1.124 1.194 1.47770 1.3675 1.5795 9.031 100 a
#> parse 38.254 38.762 40.21497 38.9630 39.3120 116.510 100 b
#> str2lang 2.214 2.304 2.55036 2.3960 2.6530 10.639 100 a
#> str2exp 2.238 2.331 2.50011 2.4210 2.6515 3.619 100 a
Created on 2020-04-23 by the reprex package (v0.3.0)
I'm therefore inclined to use call
but I want to make sure there wouldn't be any unintended consequences of doing so rather than using the other solutions. In other words, in what situations (within the context I'm using them in) would the four methods not give the same answer, leading one to favor one over the others?
Both call
and as.symbol
are fine, I think as.symbol
is preferable. as.symbol
is unambiguous R terminology (as.name
was S terminology) and is as fast as call
. A call can contain any R object, including symbols and other calls, plus documentation is all over the place. Because the strings could contain punctuation we can not be sure at which point str2lang
, or str2expression
fail or what error message might pop up.
Checking if variables exist in an environment is safer than doing so for data.frames.
a look under the hood reveals main differences are in their classes:
l = list(w = as.symbol("x"), x = str2lang("x"), y = str2expression("x"), z = call("x"))
rbind(sapply(l, class), sapply(l, typeof), sapply(l, is.language))
w x y z
[1,] "name" "name" "expression" "call"
[2,] "symbol" "symbol" "expression" "language"
[3,] "TRUE" "TRUE" "TRUE" "TRUE"
A good starting point is the R Language Definition, and for examples the Expressions chapter in Hadley Wickham's Advanced R (source), followed by reading ?call
, ?parse
, rlang::call2
.
In Evaluating the Design of the R Language (source) the following definition is given:
Arguments of calls are expressions which may be named by a symbol name.
From the ?call
documentation:
Instead of
as.call(<string>)
, consider usingstr2lang(*)
which is an efficient version ofparse(text=*)
.call()
andas.call()
, are much preferable toparse()
based approaches.
Using pryr::show_c_source(.Internal(str2lang(s)))
we can see that str2lang
and str2expression
both call
the C function do_str2lang
but differ in their argument. This difference can be found in ?parse
:
str2expression
, equal to parse(text = "x", keep.source = F)
is always an expression.str2lang("x")
, equal to parse(text = "x", keep.source = F)[[1]]
can evaluate to a call OR a symbol, NULL
, numeric, integer or logical, i.e. a call or simpler.In the R Language Definition it is mentioned that expressions are only evaluated when passed to eval
, other language objects may get evaluated in some unexpected cases. What these cases are, I could not find.
We can throw some edge cases at all approaches to see when which might fail:
"\""
wrongly parsed quote when reading in data?" "
empty variable? space in variable name?"_"
illegal token (legacy assignment operator) issues?# if it breaks, show how it breaks
do <- function(x, ...) tryCatch(eval(x, ..1, ..2), error = function(t) t$message)
check <- function(x, ...){
list(
do(call(x)[[1]], ...),
do(as.symbol(x), ...),
do(str2lang(x), ...),
do(str2expression(x), ...)
)
}
# test 1: some variables do not exist
e <- new.env() ; e$x1 <- 5
df = data.frame(x2 = 3)
no_var <- lapply(list("x1", "x2", "\"", " ", "_", "`", NA_character_), check, df, e)
# test 2: some variables exist
e <- new.env() ; e$x1 = 5; e$`_` = 2 ; e$"\"" = 5; e$`NA` <- 5
df = data.frame(x2 = 3, " " = 7, "_" = 2)
var <- lapply(list("x1", "x2", "\"", " ", "_", "`", NA_character_), check, df, e)
# difference in outcomes in
var[3:7]
no_var[3:7]
When variables do not exist:
as.symbol
and call
always reach eval
, str2lang
and str2expression
fail early in all cases. str2expression
differs from str2lang
when an empty string is given.
When variables exist:
as.symbol
and call
succeed in cases 3, 4, 7 while str2lang
and str2expression
throw errors.
On a sidenote, be careful with storing variables in data.frames, they are easy to corrupt.
names(data.frame(" " = 1, "_" = 2, "\"" = 3, "`" = 4))
[1] "X." "X_" "X..1" "X..2"