Search code examples
rperformancelapply

What are the performance differences between for-loops and the apply family of functions?


It is often said that one should prefer lapply over for loops. There are some exception as for example Hadley Wickham points out in his Advance R book.

(http://adv-r.had.co.nz/Functionals.html) (Modifying in place, Recursion etc). The following is one of this case.

Just for sake of learning, I tried to rewrite a perceptron algorithm in a functional form in order to benchmark relative performance. source (https://rpubs.com/FaiHas/197581).

Here is the code.

# prepare input
data(iris)
irissubdf <- iris[1:100, c(1, 3, 5)]
names(irissubdf) <- c("sepal", "petal", "species")
head(irissubdf)
irissubdf$y <- 1
irissubdf[irissubdf[, 3] == "setosa", 4] <- -1
x <- irissubdf[, c(1, 2)]
y <- irissubdf[, 4]

# perceptron function with for
perceptron <- function(x, y, eta, niter) {

  # initialize weight vector
  weight <- rep(0, dim(x)[2] + 1)
  errors <- rep(0, niter)


  # loop over number of epochs niter
  for (jj in 1:niter) {

    # loop through training data set
    for (ii in 1:length(y)) {

      # Predict binary label using Heaviside activation
      # function
      z <- sum(weight[2:length(weight)] * as.numeric(x[ii, 
        ])) + weight[1]
      if (z < 0) {
        ypred <- -1
      } else {
        ypred <- 1
      }

      # Change weight - the formula doesn't do anything
      # if the predicted value is correct
      weightdiff <- eta * (y[ii] - ypred) * c(1, 
        as.numeric(x[ii, ]))
      weight <- weight + weightdiff

      # Update error function
      if ((y[ii] - ypred) != 0) {
        errors[jj] <- errors[jj] + 1
      }

    }
  }

  # weight to decide between the two species

  return(errors)
}

err <- perceptron(x, y, 1, 10)

### my rewriting in functional form auxiliary
### function
faux <- function(x, weight, y, eta) {
  err <- 0
  z <- sum(weight[2:length(weight)] * as.numeric(x)) + 
    weight[1]
  if (z < 0) {
    ypred <- -1
  } else {
    ypred <- 1
  }

  # Change weight - the formula doesn't do anything
  # if the predicted value is correct
  weightdiff <- eta * (y - ypred) * c(1, as.numeric(x))
  weight <<- weight + weightdiff

  # Update error function
  if ((y - ypred) != 0) {
    err <- 1
  }
  err
}

weight <- rep(0, 3)
weightdiff <- rep(0, 3)

f <- function() {
  t <- replicate(10, sum(unlist(lapply(seq_along(irissubdf$y), 
    function(i) {
      faux(irissubdf[i, 1:2], weight, irissubdf$y[i], 
        1)
    }))))
  weight <<- rep(0, 3)
  t
}

I did not expected any consistent improvement due to the aforementioned issues. But nevertheless I was really surprised when I saw the sharp worsening using lapply and replicate.

I obtained this results using microbenchmark function from microbenchmark library

What could possibly be the reasons? Could it be some memory leak?

                                                      expr       min         lq       mean     median         uq
                                                        f() 48670.878 50600.7200 52767.6871 51746.2530 53541.2440
  perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10)  4184.131  4437.2990  4686.7506  4532.6655  4751.4795
 perceptronC(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10)    95.793   104.2045   123.7735   116.6065   140.5545
        max neval
 109715.673   100
   6513.684   100
    264.858   100

The first function is the lapply/replicate function

The second is the function with for loops

The third is the same function in C++ using Rcpp

Here According to Roland the profiling of the function. I am not sure I can interpret it in the right way. It looks like to me most of the time is spent in subsetting Function profiling


Solution

  • First of all, it is an already long debunked myth that for loops are any slower than lapply. The for loops in R have been made a lot more performant and are currently at least as fast as lapply.

    That said, you have to rethink your use of lapply here. Your implementation demands assigning to the global environment, because your code requires you to update the weight during the loop. And that is a valid reason to not consider lapply.

    lapply is a function you should use for its side effects (or lack of side effects). The function lapply combines the results in a list automatically and doesn't mess with the environment you work in, contrary to a for loop. The same goes for replicate. See also this question:

    Is R's apply family more than syntactic sugar?

    The reason your lapply solution is far slower, is because your way of using it creates a lot more overhead.

    • replicate is nothing else but sapply internally, so you actually combine sapply and lapply to implement your double loop. sapply creates extra overhead because it has to test whether or not the result can be simplified. So a for loop will be actually faster than using replicate.
    • inside your lapply anonymous function, you have to access the dataframe for both x and y for every observation. This means that -contrary to in your for-loop- eg the function $ has to be called every time.
    • Because you use these high-end functions, your 'lapply' solution calls 49 functions, compared to your for solution that only calls 26. These extra functions for the lapply solution include calls to functions like match, structure, [[, names, %in%, sys.call, duplicated, ... All functions not needed by your for loop as that one doesn't do any of these checks.

    If you want to see where this extra overhead comes from, look at the internal code of replicate, unlist, sapply and simplify2array.

    You can use the following code to get a better idea of where you lose your performance with the lapply. Run this line by line!

    Rprof(interval = 0.0001)
    f()
    Rprof(NULL)
    fprof <- summaryRprof()$by.self
    
    Rprof(interval = 0.0001)
    perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10) 
    Rprof(NULL)
    perprof <- summaryRprof()$by.self
    
    fprof$Fun <- rownames(fprof)
    perprof$Fun <- rownames(perprof)
    
    Selftime <- merge(fprof, perprof,
                      all = TRUE,
                      by = 'Fun',
                      suffixes = c(".lapply",".for"))
    
    sum(!is.na(Selftime$self.time.lapply))
    sum(!is.na(Selftime$self.time.for))
    Selftime[order(Selftime$self.time.lapply, decreasing = TRUE),
             c("Fun","self.time.lapply","self.time.for")]
    
    Selftime[is.na(Selftime$self.time.for),]