writing to global environment when running in parallel

I have a data.frame of cells, values and coordinates. It resides in the global environment.

> head(cont.values)
   cell value   x   y
1 11117    NA -34 322
2 11118    NA -30 322
3 11119    NA -26 322
4 11120    NA -22 322
5 11121    NA -18 322
6 11122    NA -14 322

Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each cell can be calculated independently, screaming for parallel execution.

What my function actually does is check if there's a value for a specified cell number and if it's NA, it calculates it and inserts it in place of NA.

I can run my magic function (result is value for a corresponding cell) using apply family of functions and from within apply, I can read and write cont.values without a problem (it's in global environment).

Now, I want to run this in parallel (using snowfall) and I'm unable to read or write from/to this variable from individual core.

Question: What solution would be able to read/write from/to a dynamic variable residing in global environment from within worker (core) when executing a function in parallel. Is there a better approach of doing this?

Solution

The pattern of a central store that workers consult for values is implemented in the rredis package on CRAN. The idea is that the Redis server maintains a store of key-value pairs (your global data frame, re-implemented). Workers query the server to see if the value has been calculated (redisGet) and if not do the calculation and store it (redisSet) so that other workers can re-use it. Workers can be R scripts, so it's easy to expand the work force. It's a very nice alternative parallel paradigm. Here's an example that uses the notion of 'memoizing' each result. We have a function that is slow (sleeps for a second)

fun <- function(x) { Sys.sleep(1); x }

We write a 'memoizer' that returns a variant of fun that first checks to see if the value for x has already been calculated, and if so uses that

memoize <-
    function(FUN)
{
    force(FUN) # circumvent lazy evaluation
    require(rredis)
    redisConnect()
    function(x)
    {
        key <- as.character(x)
        val <- redisGet(key)
        if (is.null(val)) {
            val <- FUN(x)
            redisSet(key, val)
        }
        val
    }
}

We then memoize our function

funmem <- memoize(fun)

and go

> system.time(res <- funmem(10)); res
   user  system elapsed 
  0.003   0.000   1.082 
[1] 10
> system.time(res <- funmem(10)); res
   user  system elapsed 
  0.001   0.001   0.040 
[1] 10

This does require a redis server running outside R but very easy to install; see the documentation that comes with the rredis package.

A within-R parallel version might be

library(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
clusterEvalQ(cl, { require(rredis); redisConnect() })
tasks <- sample(1:5, 100, TRUE)
system.time(res <- parSapply(cl, tasks, funmem))