Search code examples
rparallel-processingsnow

How to make a object available to node in the snow package for R parallel computing


This is something I find difficult to understand:

cl = makeCluster(rep("localhost", 8), "SOCK")

# This will not work, error: dat not found in the nodes
pmult = function(cl, a, x)
{
    mult = function(s) s*x
    parLapply(cl, a, mult)
}
scalars = 1:4
dat = rnorm(4)
pmult(cl, scalars, dat)

# This will work
pmult = function(cl, a, x)
{
    x
    mult = function(s) s*x
    parLapply(cl, a, mult)
}
scalars = 1:4
dat = rnorm(4)
pmult(cl, scalars, dat)

# This will work
pmult = function(cl, a, x)
{
    mult = function(s, x) s*x
    parLapply(cl, a, mult, x)
}
scalars = 1:4
dat = rnorm(4)
pmult(cl, scalars, dat)

The first function doesn't work because of lazy evaluation of arguments. But what is lazy evaluation? When mult() is executed, does it not require x to be evaluated? The second one works because it forces x to be evaluated. Now the most strange thing happens in the third function, nothing is done but make mult() receive x as an extra argument, and suddenly everything works!

Another thing is, what should I do if I don't want to define all the variables and functions inside the function calling parLapply()? The following definitely will not work:

pmult = function(cl)
{
    source("a_x_mult.r")
    parLapply(cl, a, mult, x)
}
scalars = 1:4
dat = rnorm(4)
pmult(cl, scalars, dat)

I can pass all these variables and functions as arguments:

f1 = function(i)
{
    return(rnorm(i))
}

f2 = function(y)
{
    return(f1(y)^2)
}

f3 = function(v)
{
    return(v- floor(v) + 100)
}

test = function(cl, f1, f2, f3)
{
    x = f2(15)
    parLapply(cl, x, f3)
}

test(cl, f1, f2, f3)

Or I can use clusterExport(), but it'll be cumbersome when there are lots of objects to be exported. Is there a better way?



Solution

  • To understand this, you have to realize that there is an environment associated with every function, and what that environment is depends on how the function was created. A function that is simply created in a script is associated with the global environment, but a function that is created by another function is associated with the local environment of the creating function. In your example, pmult creates mult, so the environment associated with mult contains the formal arguments cl, a, and x.

    The problem with the first case is that parLapply doesn't know anything about x: it is just an unevaluated formal argument that is serialized as part of the environment of mult by parLapply. Since x isn't evaluated when mult is serialized and sent to the cluster workers, it causes an error when the workers execute mult, since dat isn't available in that context. In other words, by the time mult evaluates x, it's too late.

    The second case works because x is evaluated before mult is serialized, so the actual value of x is serialized along with the environment of mult. It does what you would expect if you knew about closures but not lazy argument evaluation.

    The third case works because you're having parLapply handle x for you. There's no trickery going on at all.

    I should warn you that in all of these cases, a is being evaluated (by parLapply) and serialized along with the environment of mult. parLapply is also splitting a into chunks and sending those chunks to each worker, so the copy of a in the environment of mult is completely unnecessary. It doesn't cause an error, but it could hurt performance, since mult is sent to the workers in every task object. Fortunately, this is much less of a problem with parLapply, since there is only one task per worker. It would be a much worse problem with clusterApply or clusterApplyLB where the number of tasks is equal to the length of a.

    I talk about a number of issues relating to functions and environments in the "snow" chapter of my book. There are some subtle issues involved, and it's easy to get burned, sometimes without realizing that it happened.

    As for your second question, there are various strategies for exporting functions to the workers, but some people do use source to define functions on the workers rather than using clusterExport. Keep in mind that source has a local argument that controls where the parsed expressions are evaluated, and you may need to specify the absolute path to the script. Finally, if you're using remote cluster workers, you may need to scp the script to the workers if you don't have a distributed file system.

    Here is a simple method of exporting all of the functions in your global environment to the cluster workers:

    ex <- Filter(function(x) is.function(get(x, .GlobalEnv)), ls(.GlobalEnv))
    clusterExport(cl, ex)