I regularly make use of the packages parallel
and pbapply
. However, I have come across some odd behavior that I assume is by design, but I can't figure out how to work around it. Basically, if I use a cluster within a function, the function's entire environment gets exported to each worker regardless of whether I specify an export. Below is a trivial, meaningless example, but it illustrates the point. Inside of the function I create a matrix that is 800 MB in size. I never ask for that to be exported, yet when the workers start going they all immediately expand to 800 MB. Is there some way to stop this implicit export of x
from happening?
library(parallel)
library(pbapply)
f = function()
{
x = matrix(runif(10000*10000), nrow = 10000)
cl = makeCluster(10)
ans = pbsapply(1:1000, function(i){
w = matrix(runif(1000*1000), nrow = 1000)
return(sum(w))
}, cl = cl)
stopCluster(cl)
}
Depending on snow::getClusterOption("type")
calls either makeSOCKcluster
or makeFORKcluster
. If 'FORK', current environment gets exported. Either you code differently by avoiding such large objects in the environment or use makeSOCKcluster
and explicitly use clusterExport
.