r parallel-processing conda cluster-computing

Problem running single-node job on computing cluster in conda environment for all processors in R

I am trying to run a single-node job on my university computing cluster, submitted via qsub, within a conda environment "myenv". As soon as I start my parallelization within my R script, the different processors start with the base environment instead of with "myenv". A test parallelization works perfectly within the base environment, so the issue is starting "myenv" for all processors. The environment "myenv" also works well for the first processor.

I submit the following bash script via qsub (1 node with 20 processors):

source /mypath/conda.sh
conda activate myenv
Rscript myscript.R

I am using a conda environment with the latest R-version and specific packages. The base environment on the cluster has an extremely outdated R-version which is not compatible with the packages that I need (e.g. raster). (Updating the base environment on the cluster is not possible, according to IT).

This is my R-script:

require(raster)
require(doSNOW)

r1 <- raster("raster1.tif") # works perfectly (conda environment is activated here)

nc = 19
cl = makeSOCKcluster(nc)
registerDoSNOW(cl)

foreach(f=1:19)  %dopar% {

  require(raster) # does not work (conda environment is not activated here)
  # (error: there is no package called raster)

  r2 <- raster("raster2.tif")

}

I have tried (none of them worked for me):

foreach(f=1:19, .packages = c("raster")) %dopar% {...}
to load the conda environment with a system call (via sytem2 in R) within the loop
the future package for a parallel for-loop via %dofuture%
running the job in the old base-R installation and loading packages from a package folder via .libPaths("mypath"): works with some packages, but not other packages which are incompatible with the old base-R version

Could any one help, please? Many thanks.

Solution

Try the following shell script that uses MPI. This is more-less equivalent to your socket cluster, except that it uses MPI, which is more common on HPC clusters. It will simply run 19 copies of your code. Your module load names and requirements may vary on your cluster. You will also need to install.packages("pbdMPI") in a login-node R session before running this.

#!/bin/bash
#PBS <your-pbs-requests>
#PBS ...

module load conda
module load OpenMPI
module load r

source /mypath/conda.sh
conda activate myenv
mpirun -n 19 Rscript myscript.R

And R script, myscript.R:

library(raster)
library(pbdMPI)
rank = comm.rank()

# you probably need to form file names from ranks
filename = paste0("raster", rank, ".tif")
r <- raster(filename)

# rest of your code

The shell script runs 19 copies of the R code, where the instance ranks will be 0 to 18, each reading a different file. The script will work for any number of instances, as long as you request the resources in the shell script.

This simplifies your R code. If you need to combine the results of the 19 copies, consider allreduce() and allgather() in pbdMPI. This approach is called single-program multiple-data (SPMD), which is more general and extensible than the manager-workers style of doSNOW.