Search code examples
rparallel-processingsnowfall

writing to global environment when running in parallel


I have a data.frame of cells, values and coordinates. It resides in the global environment.

> head(cont.values)
   cell value   x   y
1 11117    NA -34 322
2 11118    NA -30 322
3 11119    NA -26 322
4 11120    NA -22 322
5 11121    NA -18 322
6 11122    NA -14 322

Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each cell can be calculated independently, screaming for parallel execution.

What my function actually does is check if there's a value for a specified cell number and if it's NA, it calculates it and inserts it in place of NA.

I can run my magic function (result is value for a corresponding cell) using apply family of functions and from within apply, I can read and write cont.values without a problem (it's in global environment).

Now, I want to run this in parallel (using snowfall) and I'm unable to read or write from/to this variable from individual core.

Question: What solution would be able to read/write from/to a dynamic variable residing in global environment from within worker (core) when executing a function in parallel. Is there a better approach of doing this?


Solution

  • The pattern of a central store that workers consult for values is implemented in the rredis package on CRAN. The idea is that the Redis server maintains a store of key-value pairs (your global data frame, re-implemented). Workers query the server to see if the value has been calculated (redisGet) and if not do the calculation and store it (redisSet) so that other workers can re-use it. Workers can be R scripts, so it's easy to expand the work force. It's a very nice alternative parallel paradigm. Here's an example that uses the notion of 'memoizing' each result. We have a function that is slow (sleeps for a second)

    fun <- function(x) { Sys.sleep(1); x }
    

    We write a 'memoizer' that returns a variant of fun that first checks to see if the value for x has already been calculated, and if so uses that

    memoize <-
        function(FUN)
    {
        force(FUN) # circumvent lazy evaluation
        require(rredis)
        redisConnect()
        function(x)
        {
            key <- as.character(x)
            val <- redisGet(key)
            if (is.null(val)) {
                val <- FUN(x)
                redisSet(key, val)
            }
            val
        }
    }
    

    We then memoize our function

    funmem <- memoize(fun)
    

    and go

    > system.time(res <- funmem(10)); res
       user  system elapsed 
      0.003   0.000   1.082 
    [1] 10
    > system.time(res <- funmem(10)); res
       user  system elapsed 
      0.001   0.001   0.040 
    [1] 10
    

    This does require a redis server running outside R but very easy to install; see the documentation that comes with the rredis package.

    A within-R parallel version might be

    library(snow)
    cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
    clusterEvalQ(cl, { require(rredis); redisConnect() })
    tasks <- sample(1:5, 100, TRUE)
    system.time(res <- parSapply(cl, tasks, funmem))