I am applying 100 different model simulations to 83 data points and examining the range of estimates from each model simulation for each data point.
Each calculation itself is the product of 342 variables multiplied by 342 coefficients, plus an intercept added. The code I wrote below calculates the values correctly, but is heinously slow. Is there any way to improve processing speed?
spec = read.csv(spectra)
coef = read.csv(coefficents)
shell = matrix(data=NA,ncol=101,nrow=nrow(spec))
shell = as.data.frame(shell)
heading = paste("Sim_",seq(1,100,1),sep="")
names(shell)[1] = "Filename"
names(shell)[2:101] = heading
shell[1] = spec[1]
for (i in 1:nrow(spec))
{
for (j in 1:100)
{
shell[i,j+1] = sum(spec[i,2:341]*coef[j,3:342]) + coef[j,2]
}
}
Really you are performing matrix multiplication of spec
with the transpose of coef
and then adding a constant to each column. You should get a speedup by using the built-in matrix multiplication function %*%
and vectorized operations to scale the columns:
out <- cbind(spec[,1], t(t(spec[,2:341] %*% t(coef[1:100,3:342])) + coef[1:100,2]))
The output of this 1-liner is identical to the output of the code in the original post (modified slightly to take and output matrices and not to set any names):
OP <- function(spec, coef) {
shell = matrix(data=NA,ncol=101,nrow=nrow(spec))
shell[,1] <- spec[,1]
for (i in 1:nrow(spec)) {
for (j in 1:100) {
shell[i,j+1] = sum(spec[i,2:341]*coef[j,3:342]) + coef[j,2]
}
}
shell
}
all.equal(out, OP(spec, coef))
# [1] TRUE
In terms of runtimes, the vectorized operations yield significant benefit (38x), even for this relatively small example (1000 rows in spec
):
system.time(cbind(spec[,1], t(t(spec[,2:341] %*% t(coef[1:100,3:342])) + coef[1:100,2])))
# user system elapsed
# 0.028 0.001 0.030
system.time(OP(spec, coef))
# user system elapsed
# 0.927 0.224 1.161
Data:
set.seed(144)
spec <- matrix(rnorm(1000*341), nrow=1000)
coef <- matrix(rnorm(100*342), nrow=100)