Search code examples
rdiffbenchmarking

Why is R diff function slow?


When working with vectors in R, the diff function computes the differences between each value and the previous one. From ?diff:

If x is a vector of length n and differences = 1, then the computed result is equal to the successive differences x[(1+lag):n] - x[1:(n-lag)]

However, when I tested the execution time of diff function vs their theoretical expression (using microbenchmark function from microbenchmark package), the diff function is slower. Here is my code:

library(microbenchmark)

mb.diff1 <- function(n, seed){
  set.seed(seed)
  vec <- runif(n)
  out <- diff(vec)
  return(out)
}

mb.diff2 <- function(n, seed){
  set.seed(seed)
  vec <- runif(n)
  out <- vec[2:n]-vec[1:(n-1)]
  return(out)
}

times.diff1 <- c()  
times.diff2 <- c()
vec.sizes <- c(1e1, 1e2, 1e3, 1e4)
for (n in vec.sizes){
  bench <- microbenchmark(
    mb.diff1(n,1),
    mb.diff2(n,1))
  times.median <- aggregate(
    bench$time,
    by  = list(bench$expr), 
    FUN = median)
  times.diff1 <- c(times.diff1, times.median[1,2])
  times.diff2 <- c(times.diff2, times.median[2,2])
}

perf.ratio <- times.diff1/times.diff2
names(perf.ratio) <- vec.sizes
print(perf.ratio)

I finished with vec.sizes of 1e4, so the excution time for you guys does not take too long but I make them go until 1e7. You can see the results here:

enter image description here

As you can see, the diff function is slower for all vector sizes. The quotient tends to decrease because it seems in both cases the execution time is a linear function of the vector size so we can't say that diff performs better as n increases. So here comes the questions:

  1. (The obvious question) Am I doing something wrong in my code when measuring the execution times?
  2. What could be the cause of the diff function being slower than their theoretical expression?
  3. Do you know a most efficient way to compute the differences vector than x[(1+lag):n] - x[1:(n-lag)]?

I'm using R 3.1.2 in Linux.

Thank you very much in advance.


Solution

  • Here is some code to help expand and illustrate the points I made in my comment.

    library(microbenchmark)
    
    mb.diff2 <- compiler::cmpfun(function(vec) {
      n <- length(vec)
      vec[2:n]-vec[1:(n-1L)]
    })
    
    times.diff1 <- c()  
    times.diff2 <- c()
    times.diff3 <- c()
    vec.sizes <- c(1e1, 1e2, 1e3, 1e4, 1e5)
    
    for (n in vec.sizes) {
      set.seed(21)
      vec <- runif(n)
      bench <- microbenchmark(diff(vec), mb.diff2(vec), diff.default(vec))
      times.median <- aggregate(bench$time, by = list(bench$expr), FUN = median)
      times.diff1 <- c(times.diff1, times.median[1,2])
      times.diff2 <- c(times.diff2, times.median[2,2])
      times.diff3 <- c(times.diff3, times.median[3,2])
    }
    
    setNames(times.diff1/times.diff2, vec.sizes)
    setNames(times.diff1/times.diff3, vec.sizes)
    

    First, you'll notice that I compiled the mb.diff2 function. This is because diff and diff.default are byte-compiled. I also put the calculation of n inside mb.diff2, since calculating the vector length should be part of the measured function call.

    Here are the results of the timings, along with my sessionInfo():

    R> setNames(times.diff1/times.diff2, vec.sizes)
           10       100      1000     10000     1e+05 
    3.5781536 2.3330988 1.2488135 0.9011312 0.9660411 
    R> setNames(times.diff1/times.diff3, vec.sizes)
           10       100      1000     10000     1e+05 
    1.5945010 1.4609283 1.1021190 1.0034623 0.9987618 
    R> sessionInfo()
    R version 3.3.2 (2016-10-31)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 16.04.1 LTS
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
     [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] microbenchmark_1.4-2
    
    loaded via a namespace (and not attached):
     [1] Rcpp_0.12.9      digest_0.6.8     MASS_7.3-45      grid_3.3.2      
     [5] plyr_1.8.4       gtable_0.1.2     magrittr_1.5     scales_0.3.0    
     [9] ggplot2_1.0.1    stringi_0.4-1    reshape2_1.4.1   proto_0.3-10    
    [13] tools_3.3.2      stringr_1.0.0    munsell_0.4.2    compiler_3.3.2  
    [17] colorspace_1.2-6