I have a function in a package (mainly for my own use currently, might share at some future point). I'm trying to replace a slow for loop with an lapply so that later I can parallelise it. So one option I found that is hugely faster even without parellelisation is to use the global assignment operator. However I'm anxious about this as this seems to be frowned upon, and I'm not used to thinking about environments and so worry about side effects:
Here is a simple reprex:
n <- 2
nx <- 40
v <- 5
d <- 3
array4d <- array(rep(0, n * nx * v * d) ,
dim = c(n, nx, v, d) )
array4d2 <- array4d
# Make some data to enter into the array - in real problem a function gens this data depending on input vars
set.seed(4)
dummy_output <- lapply(1:v, function(i) runif(n*nx*d))
microbenchmark::microbenchmark( {
for(i in 1:v){
array4d[ , , i, ] <- dummy_output[[i]]
}
}, {
lapply(1: v, function(i) {
array4d2[ , , i, ] <<- dummy_output[[i]]
})
})
Unit: microseconds
expr min lq
{ for (i in 1:v) { array4d[, , i, ] <- dummy_output[[i]] } } 1183.504 1273.6205
{ lapply(1:v, function(i) { array4d2[, , i, ] <<- dummy_output[[i]] }) } 13.257 16.1715
mean median uq max neval cld
1488.26909 1411.4565 1515.762 3535.974 100 b
33.56976 18.1445 21.150 1525.608 100 a
>
> identical(array4d, array4d2)
[1] TRUE
All of this would be happening inside a function called many times by its parent.
So this is (lots!) faster. But my questions are
<<-
?Make the varying dimension the last one. microbenchmark indicates that its performance is not statistically different than the one using a global variable. If it is important that the dimension be the third use aperm(x, c(1, 2, 4, 3))
afterwards.
microbenchmark::microbenchmark(
a = for(i in 1:v) array4d[ , , i, ] <- dummy_output[[i]],
b = lapply(1: v, function(i) array4d2[ , , i, ] <<- dummy_output[[i]]),
c = array(unlist(dummy_output), dim(array4d3))
)