I'm trying to speed up a workflow that involves multiplying rows from two data frames together via a custom function.
Right now I'm using apply() with a custom function. My understanding is that lapply() or sapply() would be faster (and eventually allow parallelization, though I'd prefer a speedup that does not depend on parallel processing), but I can't figure out the lapply() or sapply() syntax I should use with my custom function. If there's an even simpler way to vectorize the custom function and avoid *apply() altogether, that would be preferred.
The number of rows in my use case will be 1 million or more, and the number of columns will be around 15, but here's a MWE that illustrates the speed issue:
# Two data frames that will be used in the calculation. d2 can be a matrix, but d1 must be a data frame.
d1 <- data.frame(V1 = runif(1000), V2 = runif(1000), V3 = runif(1000), V4 = runif(1000))
d2 <- data.frame(Va = runif(3), V1 = runif(3), V2 = runif(3), V3 = runif(3), V4 = runif(3))
# Custom function that is applied to each row in d1
manualprob <- function(x){
xb1 <- as.numeric(rowSums(d2[1,2:ncol(d2)] * x) + d2[1,1])
xb2 <- as.numeric(rowSums(d2[2,2:ncol(d2)] * x) + d2[2,1])
xb3 <- as.numeric(rowSums(d2[3,2:ncol(d2)] * x) + d2[3,1])
denom <- 1 + exp(xb1) + exp(xb2) + exp(xb3)
prob <- exp(xb1)/denom
return(prob)
}
# apply() used below, but it is too slow
start_time <- proc.time()
d1$prob <- as.vector(apply(d1, 1, manualprob))
proc.time() - start_time
user system elapsed
1.081 0.007 1.088
Your best bet is to convert to matrices and use R's very fast matrix operations...
You can create all of the xb figures in one go with
xb <- as.matrix(d2[, -1]) %*% t(as.matrix(d1)) + d2[, 1]
This produces a 3*1000 matrix.
And then you can get the probabilities with
prob <- exp(xb[1, ]) / (1 + colSums(exp(xb)))
This all takes almost zero time on my machine!