I have a data.frame of cells, values and coordinates. It resides in the global environment.
> head(cont.values)
cell value x y
1 11117 NA -34 322
2 11118 NA -30 322
3 11119 NA -26 322
4 11120 NA -22 322
5 11121 NA -18 322
6 11122 NA -14 322
Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each cell can be calculated independently, screaming for parallel execution.
What my function actually does is check if there's a value for a specified cell number and if it's NA, it calculates it and inserts it in place of NA.
I can run my magic function (result is value
for a corresponding cell
) using apply family of functions and from within apply
, I can read and write cont.values
without a problem (it's in global environment).
Now, I want to run this in parallel (using snowfall
) and I'm unable to read or write from/to this variable from individual core.
Question: What solution would be able to read/write from/to a dynamic variable residing in global environment from within worker (core) when executing a function in parallel. Is there a better approach of doing this?
The pattern of a central store that workers consult for values is implemented in the rredis package on CRAN. The idea is that the Redis server maintains a store of key-value pairs (your global data frame, re-implemented). Workers query the server to see if the value has been calculated (redisGet
) and if not do the calculation and store it (redisSet
) so that other workers can re-use it. Workers can be R scripts, so it's easy to expand the work force. It's a very nice alternative parallel paradigm. Here's an example that uses the notion of 'memoizing' each result. We have a function that is slow (sleeps for a second)
fun <- function(x) { Sys.sleep(1); x }
We write a 'memoizer' that returns a variant of fun
that first checks to see if the value for x
has already been calculated, and if so uses that
memoize <-
function(FUN)
{
force(FUN) # circumvent lazy evaluation
require(rredis)
redisConnect()
function(x)
{
key <- as.character(x)
val <- redisGet(key)
if (is.null(val)) {
val <- FUN(x)
redisSet(key, val)
}
val
}
}
We then memoize our function
funmem <- memoize(fun)
and go
> system.time(res <- funmem(10)); res
user system elapsed
0.003 0.000 1.082
[1] 10
> system.time(res <- funmem(10)); res
user system elapsed
0.001 0.001 0.040
[1] 10
This does require a redis server running outside R but very easy to install; see the documentation that comes with the rredis package.
A within-R parallel version might be
library(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
clusterEvalQ(cl, { require(rredis); redisConnect() })
tasks <- sample(1:5, 100, TRUE)
system.time(res <- parSapply(cl, tasks, funmem))