I have very large dataset with dimension of 60K x 4 K
. I am trying add every four values in succession in every row column wise. The following is smaller example dataset.
set.seed(123)
mat <- matrix (sample(0:1, 48, replace = TRUE), 4)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 0 1 1 1 0 1 1 0 1 1 0 0
[2,] 1 0 0 1 0 1 1 0 1 0 0 0
[3,] 0 1 1 0 0 1 1 1 0 0 0 0
[4,] 1 1 0 1 1 1 1 1 0 0 0 0
Here is what I am trying to perform:
mat[1,1] + mat[1,2] + mat[1,3] + mat[1,4] = 0 + 1 + 1 + 1 = 3
i.e. add every four values and output.
mat[1,5] + mat[1,6] + mat[1,7] + mat[1,8] = 0 + 1 + 1 + 0 = 2
Keep going to end of matrix (here to 12).
mat[1,9] + mat[1,10] + mat[1,11] + mat[1,12]
Once first row is done apply the same to second row, like:
mat[2,1] + mat[2,2] + mat[2,3] + mat[2,4]
mat[2,5] + mat[2,6] + mat[2,7] + mat[2,8]
mat[2,9] + mat[2,10] + mat[2,11] + mat[2,12]
The result will be nrow x (ncol)/4
matrix.
The expected result will look like:
col1-col4 col5-8 col9-12
row1 3 2 2
row2 2 2 1
row3 2 3 0
row4 3 4 0
Similarly for row 3 to number of rows in the matrix. How can I efficiently loop this.
While Matthew's answer is really cool (+1, btw), you can get a much (~100x) faster solution if you avoid apply
and use the *Sums
functions (in this case colSums
), and a bit of vector manipulation trickery:
funSums <- function(mat) {
t.mat <- t(mat) # rows become columns
dim(t.mat) <- c(4, length(t.mat) / 4) # wrap columns every four items (this is what we want to sum)
t(matrix(colSums(t.mat), nrow=ncol(mat) / 4)) # sum our new 4 element columns, and reconstruct desired output format
}
set.seed(123)
mat <- matrix(sample(0:1, 48, replace = TRUE), 4)
funSums(mat)
Produces desired output:
[,1] [,2] [,3]
[1,] 3 2 2
[2,] 2 2 1
[3,] 2 3 0
[4,] 3 4 0
Now, let's make something the real size and compare against the other options:
set.seed(123)
mat <- matrix(sample(0:1, 6e5, replace = TRUE), 4)
funApply <- function(mat) { # Matthew's Solution
apply(array(mat, dim=c(4, 4, ncol(mat) / 4)), MARGIN=c(1,3), FUN=sum)
}
funRcpp <- function(mat) { # David's Solution
roll_sum(mat, 4, by.column = F)[, seq_len(ncol(mat) - 4 + 1)%%4 == 1]
}
library(microbenchmark)
microbenchmark(times=10,
funSums(mat),
funApply(mat),
funRcpp(mat)
)
Produces:
Unit: milliseconds
expr min lq median uq max neval
funSums(mat) 4.035823 4.079707 5.256517 7.5359 42.06529 10
funApply(mat) 379.124825 399.060015 430.899162 455.7755 471.35960 10
funRcpp(mat) 18.481184 20.364885 38.595383 106.0277 132.93382 10
And to check:
all.equal(funSums(mat), funApply(mat))
# [1] TRUE
all.equal(funSums(mat), funRcpp(mat))
# [1] TRUE
The key point is that the *Sums
functions are fully "vectorized", in as much as all the calculations happen in C. apply
still needs to do a bunch of not strictly vectorized (in the primitive C function way) stuff in R, and is slower (but far more flexible).
Specific to this problem, it might be possible to make it 2-3x faster as about half the time is spent on the transpositions, which are only necessary so that the dim
changes do what I need for colSums
to work.