Search code examples
ruser-defined-functions

Defining functions in R using alternative syntax approaches


In my work I have always written user-defined functions in R like this:

f <- function(x){
  x ^ 2
}

f(10)
# [1] 100

I recently came across an alternative method to call a function in R:

(function(x) x ^ 2)(10)

[1] 100

I wasn't sure what was going, so after some searching I found a wonderful answer provided by Allan Cameron that a non-programmer like me can understand:

The R parser recognizes this as meaning "call that function with these arguments".

This clarifies my understanding of what's going on, but not why or if I should choose one syntax over the other (aside from personal preference).

I use UDFs a lot for various simulations and models that generate a lot of data and sometimes run a while, so always looking to optimize. Aside from just being an alternative syntax, I wanted to see if there was a functional programatic- or machine-based reason to write format one over the other. After comparing some simple functions (below), it appears my "usual" way of writing (f <- function(x)(...)) is substantially faster across a few types of simplified functions, about twice as fast for these simple examples.

Aside from syntax/personal preference, is there a reason or a use-case/relevant example of when the "new-to-me" way of writing a function ((function(x)(x^2))(10)) would be superior to the "usual" way (f <- function(x)(...)?

In other words: why does this option exist/why would someone use this syntax?

I couldn't find anything after searching a few ways and reading this, this, this, and this - in fact, I found surprisingly little about the "alternative" syntax online at all.


Comparisons

# Function 1
f1 <- function(x) {
  x <- as.numeric(x)
  x[x < 10] <- x[x < 10] ^ 2 / pi
  x
}

# Function 2
f2 <- Vectorize(function(x) {
  paste0("num_", 1:x)
  })

# Function 3
f3 <- Vectorize(function(x){
  if(x < 10)
    x < x + 1
    while(x <10) {
      x <- x+1
    }
  x
})

Compare

microbenchmark::microbenchmark(
  `f1` =  f1(c("10", 20, "5")),
  `(function (x)(f1))` = (function(x) {
                            x <- as.numeric(x)
                            x[x < 10] <- x[x < 10] ^ 2 / pi
                            x
                          })(c("10", 20, "5")),
  `f2` =  f2(c(5,10)),
  `(function (x)(f2))` = Vectorize((function(x) paste0("num_", 1:x)))(c(5,10)),
  `f3` = f3(1:15),
  `(function (x)(f3))` = Vectorize((function(x){
                            if(x < 10)
                              x < x + 1
                               while(x <10) {
                                  x <- x+1
                                  }
                             x}))(1:15),
  times = 1e4
)

Results

Unit: microseconds
               expr    min      lq      mean  median      uq      max neval
                 f1  2.446  3.9700  5.220236  5.1720  6.0460  113.914 10000
 (function (x)(f1))  3.270  5.3725  6.741182  6.6260  7.7385   46.308 10000
                 f2 26.388 30.1885 34.328340 32.7105 37.1725  227.455 10000
 (function (x)(f2)) 53.808 60.6005 71.588443 65.7905 75.2055 5997.770 10000
                 f3 30.294 34.5735 42.077120 37.5160 42.4010 6121.705 10000
 (function (x)(f3)) 58.492 65.1845 78.551417 70.5040 80.6610 6945.062 10000

Solution

  • That's an anonymous function, as has been mentioned in the comments. Some points on this:

    • A "named function" is just an anonymous function assigned to an object. Just like pi is an object, so is mean (the function). (Credit: @G.Grothendieck) The variable symbol (pi and mean) is just a reference to where the object is stored in memory (loosely summarized), no more no less.

    • Have you ever seen sapply or similar used in the following way?

      sapply(mtcars, function(z) sum(z %in% c(4,6,21)))
      

      In that use, the anon-func could easily be stored in a variable and used there:

      func <- function(z) sum(z %in% c(4,6,21))
      sapply(mtcars, func)
      #  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
      #    2   18    0    0    0    0    0    0    0   12   11 
      

      And we can use that named-func and anon-func in identical ways:

      func(mtcars$cyl)
      # [1] 18
      (function(z) sum(z %in% c(4,6,21)))(mtcars$cyl)
      # [1] 18
      
    • New to R is a slightly smaller (code-golf) way to define functions:

      (\(z) sum(z %in% c(4,6,21)))(mtcars$cyl)
      # [1] 18
      

      The two methods (function(x) ... and \(x) ...) are identical, there is no advantage to either one (other than my muscle-memory having a harder time typing \(x) quickly).

    • Your benchmark makes sense to me: when you use an anonymous function, included in the time-to-execute is the time to parse the function. Each time the benchmark executes an expression with an anon-func, it parses the function body, stores it (behind the scenes) in a new memory location, and then executes it. There is no efficient mechanism here to detect that the expression it is parsing is identical to a previous one, so each time a specific expression is parsed, it is parsed as "new" and stored as "unique" despite the fact that it its parsing produces identical results as the parsing of the last n replications in the benchmark.

      When you use the named functions (f1(..)), and since the parsing of f1 happens once and outside of the benchmarking, the time-to-parse never enters into the timing.

      When an anon-func is used within sapply, it is parsed once and then reused for each execution of that call to sapply. For instance, if I call

      sapply(mtcars, function(z) sum(z %in% c(4,6,21)))
      sapply(mtcars, function(z) sum(z %in% c(4,6,21)))
      

      while we can see that the anon-func is identical in both cases, each call to sapply results in the anon-func being parsed twice, once for each sapply. In cases like this, there is a micro-savings in defining the function as a variable and calling it in sapply as the named function. To say "micro" in this case is an exaggeration ... with everything else going on, I don't think we can easily measure a significant difference in execution time, but ...

      bench::mark(
        anon  = sapply(mtcars, function(z) sum(z %in% c(4,6,21))), 
        named = sapply(mtcars, func),
        iterations = 100000
      )
      # # A tibble: 2 × 13
      #   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result     memory     time       gc      
      #   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>     <list>     <list>     <list>  
      # 1 anon           27µs     32µs    29343.   12.92KB     9.39 99968    32      3.41s <int [11]> <Rprofmem> <bench_tm> <tibble>
      # 2 named        27.1µs   32.6µs    28943.    3.27KB     8.98 99969    31      3.45s <int [11]> <Rprofmem> <bench_tm> <tibble>
      

      the fact that the anon= measures slightly faster here is both a little perplexing and perhaps confirmation that the time-savings in this case is so small as to be dwarfed by other factors.