In my work I have always written user-defined functions in R like this:
f <- function(x){
x ^ 2
}
f(10)
# [1] 100
I recently came across an alternative method to call a function in R:
(function(x) x ^ 2)(10)
[1] 100
I wasn't sure what was going, so after some searching I found a wonderful answer provided by Allan Cameron that a non-programmer like me can understand:
The R parser recognizes this as meaning "call that function with these arguments".
This clarifies my understanding of what's going on, but not why or if I should choose one syntax over the other (aside from personal preference).
I use UDFs a lot for various simulations and models that generate a lot of data and sometimes run a while, so always looking to optimize. Aside from just being an alternative syntax, I wanted to see if there was a functional programatic- or machine-based reason to write format one over the other. After comparing some simple functions (below), it appears my "usual" way of writing (f <- function(x)(...)
) is substantially faster across a few types of simplified functions, about twice as fast for these simple examples.
Aside from syntax/personal preference, is there a reason or a use-case/relevant example of when the "new-to-me" way of writing a function ((function(x)(x^2))(10)
) would be superior to the "usual" way (f <- function(x)(...)
?
In other words: why does this option exist/why would someone use this syntax?
I couldn't find anything after searching a few ways and reading this, this, this, and this - in fact, I found surprisingly little about the "alternative" syntax online at all.
Comparisons
# Function 1
f1 <- function(x) {
x <- as.numeric(x)
x[x < 10] <- x[x < 10] ^ 2 / pi
x
}
# Function 2
f2 <- Vectorize(function(x) {
paste0("num_", 1:x)
})
# Function 3
f3 <- Vectorize(function(x){
if(x < 10)
x < x + 1
while(x <10) {
x <- x+1
}
x
})
Compare
microbenchmark::microbenchmark(
`f1` = f1(c("10", 20, "5")),
`(function (x)(f1))` = (function(x) {
x <- as.numeric(x)
x[x < 10] <- x[x < 10] ^ 2 / pi
x
})(c("10", 20, "5")),
`f2` = f2(c(5,10)),
`(function (x)(f2))` = Vectorize((function(x) paste0("num_", 1:x)))(c(5,10)),
`f3` = f3(1:15),
`(function (x)(f3))` = Vectorize((function(x){
if(x < 10)
x < x + 1
while(x <10) {
x <- x+1
}
x}))(1:15),
times = 1e4
)
Results
Unit: microseconds
expr min lq mean median uq max neval
f1 2.446 3.9700 5.220236 5.1720 6.0460 113.914 10000
(function (x)(f1)) 3.270 5.3725 6.741182 6.6260 7.7385 46.308 10000
f2 26.388 30.1885 34.328340 32.7105 37.1725 227.455 10000
(function (x)(f2)) 53.808 60.6005 71.588443 65.7905 75.2055 5997.770 10000
f3 30.294 34.5735 42.077120 37.5160 42.4010 6121.705 10000
(function (x)(f3)) 58.492 65.1845 78.551417 70.5040 80.6610 6945.062 10000
That's an anonymous function, as has been mentioned in the comments. Some points on this:
A "named function" is just an anonymous function assigned to an object. Just like pi
is an object, so is mean
(the function). (Credit: @G.Grothendieck) The variable symbol (pi
and mean
) is just a reference to where the object is stored in memory (loosely summarized), no more no less.
Have you ever seen sapply
or similar used in the following way?
sapply(mtcars, function(z) sum(z %in% c(4,6,21)))
In that use, the anon-func could easily be stored in a variable and used there:
func <- function(z) sum(z %in% c(4,6,21))
sapply(mtcars, func)
# mpg cyl disp hp drat wt qsec vs am gear carb
# 2 18 0 0 0 0 0 0 0 12 11
And we can use that named-func and anon-func in identical ways:
func(mtcars$cyl)
# [1] 18
(function(z) sum(z %in% c(4,6,21)))(mtcars$cyl)
# [1] 18
New to R is a slightly smaller (code-golf) way to define functions:
(\(z) sum(z %in% c(4,6,21)))(mtcars$cyl)
# [1] 18
The two methods (function(x) ...
and \(x) ...
) are identical, there is no advantage to either one (other than my muscle-memory having a harder time typing \(x)
quickly).
Your benchmark makes sense to me: when you use an anonymous function, included in the time-to-execute is the time to parse the function. Each time the benchmark executes an expression with an anon-func, it parses the function body, stores it (behind the scenes) in a new memory location, and then executes it. There is no efficient mechanism here to detect that the expression it is parsing is identical to a previous one, so each time a specific expression is parsed, it is parsed as "new" and stored as "unique" despite the fact that it its parsing produces identical results as the parsing of the last n
replications in the benchmark.
When you use the named functions (f1(..)
), and since the parsing of f1
happens once and outside of the benchmarking, the time-to-parse never enters into the timing.
When an anon-func is used within sapply
, it is parsed once and then reused for each execution of that call to sapply
. For instance, if I call
sapply(mtcars, function(z) sum(z %in% c(4,6,21)))
sapply(mtcars, function(z) sum(z %in% c(4,6,21)))
while we can see that the anon-func is identical in both cases, each call to sapply
results in the anon-func being parsed twice, once for each sapply
. In cases like this, there is a micro-savings in defining the function as a variable and calling it in sapply
as the named function. To say "micro" in this case is an exaggeration ... with everything else going on, I don't think we can easily measure a significant difference in execution time, but ...
bench::mark(
anon = sapply(mtcars, function(z) sum(z %in% c(4,6,21))),
named = sapply(mtcars, func),
iterations = 100000
)
# # A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 anon 27µs 32µs 29343. 12.92KB 9.39 99968 32 3.41s <int [11]> <Rprofmem> <bench_tm> <tibble>
# 2 named 27.1µs 32.6µs 28943. 3.27KB 8.98 99969 31 3.45s <int [11]> <Rprofmem> <bench_tm> <tibble>
the fact that the anon=
measures slightly faster here is both a little perplexing and perhaps confirmation that the time-savings in this case is so small as to be dwarfed by other factors.