When working with vectors in R, the diff
function computes the differences between each value and the previous one. From ?diff
:
If
x
is a vector of lengthn
anddifferences = 1
, then the computed result is equal to the successive differencesx[(1+lag):n] - x[1:(n-lag)]
However, when I tested the execution time of diff
function vs their theoretical expression (using microbenchmark
function from microbenchmark
package), the diff
function is slower. Here is my code:
library(microbenchmark)
mb.diff1 <- function(n, seed){
set.seed(seed)
vec <- runif(n)
out <- diff(vec)
return(out)
}
mb.diff2 <- function(n, seed){
set.seed(seed)
vec <- runif(n)
out <- vec[2:n]-vec[1:(n-1)]
return(out)
}
times.diff1 <- c()
times.diff2 <- c()
vec.sizes <- c(1e1, 1e2, 1e3, 1e4)
for (n in vec.sizes){
bench <- microbenchmark(
mb.diff1(n,1),
mb.diff2(n,1))
times.median <- aggregate(
bench$time,
by = list(bench$expr),
FUN = median)
times.diff1 <- c(times.diff1, times.median[1,2])
times.diff2 <- c(times.diff2, times.median[2,2])
}
perf.ratio <- times.diff1/times.diff2
names(perf.ratio) <- vec.sizes
print(perf.ratio)
I finished with vec.sizes of 1e4, so the excution time for you guys does not take too long but I make them go until 1e7. You can see the results here:
As you can see, the diff
function is slower for all vector sizes. The quotient tends to decrease because it seems in both cases the execution time is a linear function of the vector size so we can't say that diff
performs better as n increases. So here comes the questions:
diff
function being slower than their theoretical expression?x[(1+lag):n] - x[1:(n-lag)]
?I'm using R 3.1.2 in Linux.
Thank you very much in advance.
Here is some code to help expand and illustrate the points I made in my comment.
library(microbenchmark)
mb.diff2 <- compiler::cmpfun(function(vec) {
n <- length(vec)
vec[2:n]-vec[1:(n-1L)]
})
times.diff1 <- c()
times.diff2 <- c()
times.diff3 <- c()
vec.sizes <- c(1e1, 1e2, 1e3, 1e4, 1e5)
for (n in vec.sizes) {
set.seed(21)
vec <- runif(n)
bench <- microbenchmark(diff(vec), mb.diff2(vec), diff.default(vec))
times.median <- aggregate(bench$time, by = list(bench$expr), FUN = median)
times.diff1 <- c(times.diff1, times.median[1,2])
times.diff2 <- c(times.diff2, times.median[2,2])
times.diff3 <- c(times.diff3, times.median[3,2])
}
setNames(times.diff1/times.diff2, vec.sizes)
setNames(times.diff1/times.diff3, vec.sizes)
First, you'll notice that I compiled the mb.diff2
function. This is because diff
and diff.default
are byte-compiled. I also put the calculation of n
inside mb.diff2
, since calculating the vector length should be part of the measured function call.
Here are the results of the timings, along with my sessionInfo()
:
R> setNames(times.diff1/times.diff2, vec.sizes)
10 100 1000 10000 1e+05
3.5781536 2.3330988 1.2488135 0.9011312 0.9660411
R> setNames(times.diff1/times.diff3, vec.sizes)
10 100 1000 10000 1e+05
1.5945010 1.4609283 1.1021190 1.0034623 0.9987618
R> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 digest_0.6.8 MASS_7.3-45 grid_3.3.2
[5] plyr_1.8.4 gtable_0.1.2 magrittr_1.5 scales_0.3.0
[9] ggplot2_1.0.1 stringi_0.4-1 reshape2_1.4.1 proto_0.3-10
[13] tools_3.3.2 stringr_1.0.0 munsell_0.4.2 compiler_3.3.2
[17] colorspace_1.2-6