I have a 2x4 matrix A and a 3x4 matrix B. I would like to get an array C of dimensions (2, 3, 4) where the ijk
entry of C is the product of the ik
entry of A and the jk
entry of B.
Is there a fast way to do this in R by avoiding a loop? Example below with two ways to calculate what I am looking for -- both of which involve for loops
A <- matrix(1:8, 2, 4)
B <- matrix(11:22, 3, 4)
C1 <- array(NA, dim=c(2, 3, 4))
for (ii in 1:2) {
for (jj in 1:3) {
C1[ii, jj, ] <- A[ii, ] * B[jj, ]
}
}
C2 <- array(NA, dim=c(2, 3, 4))
for (ss in 1:4) {
C2[, , ss] <- outer(A[, ss], B[, ss])
}
Not much of an improvement, but using abind
:
C3 <- do.call(abind::abind, c(lapply(seq(ncol(A)), function(ss) outer(A[,ss], B[,ss])), along=3))
NOTE: the previous time I ran this benchmark was on a different laptop running R-4.0.5, and in that case, running the benchmark several times, C1
's performance as on par with C2
's. I switched laptops (for other reasons), saw @jay.sf's answer and wanted to add it to the fray, and now C1
is significantly faster. I cannot explain the different, but the relevant specs:
It seems to me that the "turbulence" in the benchmarks is due to the small data size, influenced heavily by administrative overhead. If the matrices are much larger, I would expect better breakout (without verification).
bench::mark(
C1 = {
for (ii in 1 : 2) {
for (jj in 1 : 3) {
C1[ii, jj, ] <- A[ii, ] * B[jj, ]
}
}
as.numeric(C1)
},
C2 = {
for (ss in 1 : 4) {
C2[, , ss] <- outer(A[, ss], B[, ss])
}
as.numeric(C2)
},
C3 = as.numeric(do.call(abind::abind, c(lapply(seq(ncol(A)), function(ss) outer(A[,ss], B[,ss])), along=3))),
C4 = as.numeric(vapply(1:4, \(ss) outer(A[, ss], B[, ss]), matrix(0, 2, 3)))
)
# # A tibble: 4 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 C1 16.6us 52.3us 18669. 240B 0 9317 0 499ms <dbl [24]> <Rprofmem [1 x 3]> <bench_tm [9,317]> <tibble [9,317 x 3]>
# 2 C2 25.8us 85.6us 11225. 240B 2.14 5235 1 466ms <dbl [24]> <Rprofmem [1 x 3]> <bench_tm [5,236]> <tibble [5,236 x 3]>
# 3 C3 763us 797.4us 1144. 720B 0 572 0 500ms <dbl [24]> <Rprofmem [3 x 3]> <bench_tm [572]> <tibble [572 x 3]>
# 4 C4 24.8us 36.7us 21752. 240B 2.18 9999 1 460ms <dbl [24]> <Rprofmem [1 x 3]> <bench_tm [10,000]> <tibble [10,000 x 3]>
(I added as.numeric
to each of them because some return ints, some doubles, and I didn't want to as.numeric
one and bias the benchmark. It's not strictly required for any of them, but now we can be assured that they are all equal, since otherwise bench::mark
would fail, complaining that the output differs.)
I like @jay.sf's answer because it is both fast and does not require additional packages (abind
is non-standard but handy to have nonetheless).
Let's grow the data a little.
A2 <- do.call(cbind, replicate(50, do.call(rbind, replicate(50, A, simplify = FALSE)), simplify = FALSE))
Abig <- do.call(cbind, replicate(50, do.call(rbind, replicate(50, A, simplify = FALSE)), simplify = FALSE))
Bbig <- do.call(cbind, replicate(50, do.call(rbind, replicate(50, B, simplify = FALSE)), simplify = FALSE))
C1big <- C2big <- array(dim=c(dim(Abig)[1], dim(Bbig)))
dim(Abig)
# [1] 100 200
Now a different benchmark, now on windows 11, R-4.1.2:
bench::mark(
C1big = {
for (ii in seq(nrow(Abig))) {
for (jj in seq(nrow(Bbig))) {
C1big[ii, jj, ] <- Abig[ii, ] * Bbig[jj, ]
}
}
as.numeric(C1big)
},
C2big = {
for (ss in seq(ncol(Abig))) {
C2big[, , ss] <- outer(Abig[, ss], Bbig[, ss])
}
as.numeric(C2big)
},
C3big = as.numeric(do.call(abind::abind, c(lapply(seq(ncol(Abig)), function(ss) outer(Abig[,ss], Bbig[,ss])), along=3))),
C4big = as.numeric(vapply(seq(ncol(Abig)), function(ss) outer(Abig[, ss], Bbig[, ss]), matrix(0, nrow(Abig), nrow(Bbig)))),
iterations = 30
)
# # A tibble: 4 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 C1big 100.7ms 135ms 7.38 83.5MB 2.68 22 8 2.98s <dbl [3,000,000]> <Rprofmem [75,001 x 3]> <bench_tm [30]> <tibble [30 x 3]>
# 2 C2big 22ms 24.4ms 36.0 46.8MB 7.20 25 5 694.7ms <dbl [3,000,000]> <Rprofmem [1,801 x 3]> <bench_tm [30]> <tibble [30 x 3]>
# 3 C3big 26.5ms 28.4ms 32.8 92.6MB 28.7 16 14 488.54ms <dbl [3,000,000]> <Rprofmem [1,632 x 3]> <bench_tm [30]> <tibble [30 x 3]>
# 4 C4big 10ms 17.4ms 61.2 46.7MB 18.6 23 7 375.9ms <dbl [3,000,000]> <Rprofmem [1,402 x 3]> <bench_tm [30]> <tibble [30 x 3]>
It seems like we're getting to where I would expect: jay.sf's vapply
implementation should do very well, and it appears to be breaking out with `itr/sec`
and mem_alloc
being good indicators.