I have a strong use case for parallelizing a flavor of the SGD algorithm. In such use-case I need to update the matrices P and Q with the delta gradient update and for a random batch of samples. Each process will update mutually exclusive indices on both matrices.
A simple illustration of what I intend to do would be something like this:
# create "big" matrix
A <- matrix(rnorm(10000), 100, 100)
system.time(
# update each row vector independently using all my cores
r <- mclapply(1:100, mc.cores = 6, function(i) {
# updating ...
A[i,] <- A[i,] - 0.01
# return something, i.e. here I'd return the RMSE of this batch instead
sqrt(sum(A[i,]^2))
})
)
Are there any drawbacks on using this approach? are there more R-idiomatic alternatives?
For example, to be clean (i.e. no side effects, immutable computation) returning the update A[i,] - 0.01
instead of the RMSE
would be more complex to program and peak on memory usage or even run out of memory.
Reimplementing your code, by block, using shared data with package {bigstatsr}:
N <- 10e3
A <- matrix(rnorm(N * N), N)
library(bigstatsr)
bigA <- as_FBM(A)
library(doParallel)
registerDoParallel(cl <- makeCluster(4))
system.time(
r <- foreach(i = seq_len(N), .combine = 'c') %dopar% {
# updating ...
A[i,] <- A[i,] - 0.01
# return something, i.e. here I'd return the RMSE of this batch instead
sqrt(sum(A[i,]^2))
}
) # 11 sec
stopCluster(cl)
registerDoParallel(cl <- makeCluster(4))
system.time(
r2 <- big_apply(bigA, function(X, ind) {
# updating ...
tmp <- bigA[ind, ] <- bigA[ind, ] - 0.01
# return something, i.e. here I'd return the RMSE of this batch instead
sqrt(rowSums(tmp^2))
}, a.combine = 'c')
) # 1 sec
stopCluster(cl)
all.equal(r, r2) # TRUE
Again, it would be better to update columns instead of rows.