This is something I find difficult to understand:
cl = makeCluster(rep("localhost", 8), "SOCK")
# This will not work, error: dat not found in the nodes
pmult = function(cl, a, x)
{
mult = function(s) s*x
parLapply(cl, a, mult)
}
scalars = 1:4
dat = rnorm(4)
pmult(cl, scalars, dat)
# This will work
pmult = function(cl, a, x)
{
x
mult = function(s) s*x
parLapply(cl, a, mult)
}
scalars = 1:4
dat = rnorm(4)
pmult(cl, scalars, dat)
# This will work
pmult = function(cl, a, x)
{
mult = function(s, x) s*x
parLapply(cl, a, mult, x)
}
scalars = 1:4
dat = rnorm(4)
pmult(cl, scalars, dat)
The first function doesn't work because of lazy evaluation of arguments. But what is lazy evaluation? When mult() is executed, does it not require x to be evaluated? The second one works because it forces x to be evaluated. Now the most strange thing happens in the third function, nothing is done but make mult() receive x as an extra argument, and suddenly everything works!
Another thing is, what should I do if I don't want to define all the variables and functions inside the function calling parLapply()? The following definitely will not work:
pmult = function(cl)
{
source("a_x_mult.r")
parLapply(cl, a, mult, x)
}
scalars = 1:4
dat = rnorm(4)
pmult(cl, scalars, dat)
I can pass all these variables and functions as arguments:
f1 = function(i)
{
return(rnorm(i))
}
f2 = function(y)
{
return(f1(y)^2)
}
f3 = function(v)
{
return(v- floor(v) + 100)
}
test = function(cl, f1, f2, f3)
{
x = f2(15)
parLapply(cl, x, f3)
}
test(cl, f1, f2, f3)
Or I can use clusterExport(), but it'll be cumbersome when there are lots of objects to be exported. Is there a better way?
To understand this, you have to realize that there is an environment associated with every function, and what that environment is depends on how the function was created. A function that is simply created in a script is associated with the global environment, but a function that is created by another function is associated with the local environment of the creating function. In your example, pmult
creates mult
, so the environment associated with mult
contains the formal arguments cl
, a
, and x
.
The problem with the first case is that parLapply
doesn't know anything about x
: it is just an unevaluated formal argument that is serialized as part of the environment of mult
by parLapply
. Since x
isn't evaluated when mult
is serialized and sent to the cluster workers, it causes an error when the workers execute mult
, since dat
isn't available in that context. In other words, by the time mult
evaluates x
, it's too late.
The second case works because x
is evaluated before mult
is serialized, so the actual value of x
is serialized along with the environment of mult
. It does what you would expect if you knew about closures but not lazy argument evaluation.
The third case works because you're having parLapply
handle x
for you. There's no trickery going on at all.
I should warn you that in all of these cases, a
is being evaluated (by parLapply
) and serialized along with the environment of mult
. parLapply
is also splitting a
into chunks and sending those chunks to each worker, so the copy of a
in the environment of mult
is completely unnecessary. It doesn't cause an error, but it could hurt performance, since mult
is sent to the workers in every task object. Fortunately, this is much less of a problem with parLapply
, since there is only one task per worker. It would be a much worse problem with clusterApply
or clusterApplyLB
where the number of tasks is equal to the length of a
.
I talk about a number of issues relating to functions and environments in the "snow" chapter of my book. There are some subtle issues involved, and it's easy to get burned, sometimes without realizing that it happened.
As for your second question, there are various strategies for exporting functions to the workers, but some people do use source
to define functions on the workers rather than using clusterExport
. Keep in mind that source
has a local
argument that controls where the parsed expressions are evaluated, and you may need to specify the absolute path to the script. Finally, if you're using remote cluster workers, you may need to scp the script to the workers if you don't have a distributed file system.
Here is a simple method of exporting all of the functions in your global environment to the cluster workers:
ex <- Filter(function(x) is.function(get(x, .GlobalEnv)), ls(.GlobalEnv))
clusterExport(cl, ex)