I've been trying to run Rmpi
and snowfall
on my university's clusters but for some reason no matter how many compute nodes I get allocated, my snowfall
initialization keeps running on only one node.
Here's how I'm initializing it:
sfInit(parallel=TRUE, cpus=10, type="MPI")
Any ideas? I'll provide clarification as needed.
To run an Rmpi-based program on a cluster, you need to request multiple nodes using your batch queueing system, and then execute your R script from the job script via a utility such as mpirun/mpiexec. Ideally, the mpirun utility has been built to automatically detect what nodes have been allocated by the batch queueing system, otherwise you will need to use an mpirun argument such as --hostfile
to tell it what nodes to use.
In your case, it sounds like you requested multiple nodes, so the problem is probably with the way that the R script is executed. Some people don't realize that they need to use mpirun/mpiexec, and the result is that your script runs on a single node. If you are using mpirun, it may be that your installation of Open MPI wasn't built with support for your batch queueing system. In that case, you would have to create an appropriate hostfile from information supplied by your batch queueing system which is usually supplied via an environment variable and/or a file.
Here is a typical mpirun command that I use to execute my parallel R scripts from the job script:
mpirun -np 1 R --slave -f par.R
Since we build Open MPI with support for Torque, I don't use the --hostfile
option: mpirun figures out what nodes to use from the PBS_NODEFILE
environment variable automatically. The use of -np 1
may seem strange, but is needed if your program is going to spawn workers, which is typically done when using the snow
package. I've never used snowfall
, but after looking over the source code, it appears to me that sfInit
always calls makeMPIcluster
with a "count" argument which will cause snow
to spawn workers, so I think that -np 1
is required for MPI clusters with snowfall
. Otherwise, mpirun will start your R script on multiple nodes, and each one will spawn 10 workers on their own node which is not what you want. The trick is to set the sfInit
"cpus" argument to a value that is consistent with the number of nodes allocated to your job by the batch queueing system. You may find the Rmpi
mpi.universe.size
function useful for that.
If you think that all of this is done correctly, the problem may be with the way that the MPI cluster object is being created in your R script, but I suspect that it has to do with the use (or lack of use) of mpirun.