I ran below script for SLURM RStudio setup (currently running):
#!/bin/bash
#SBATCH --job-name=nodes
#SBATCH --output=a.log
#SBATCH --ntasks=18
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=7gb
date;hostname;pwd
module load R/4.2
rserver <- runs RStudio server
Which runs 8 cores with 18 nodes (144 cores).
However, when I check the number of cores available for parallel processing in the R console, it says 32 instead.
Here's the code for checking.
library(doParallel)
detectCores() # 32
Even worse, with another package, parallelly
(or future
) that considers the scheduler setting, it reports differently.
From parallely package:
For instance, if compute cluster schedulers are used (e.g. TORQUE/PBS and Slurm), they set specific environment variable specifying the number of cores that was allotted to any given job; availableCores() acknowledges these as well.)
library(parallelly)
availableCores() # 8
I am wondering if the current R is running with the above scheduler specification (144 cores) and if I am missing something important.
Also, could you recommend how to check available resources (core / memory) allocated and able to use in R with slurm setting?
Thank you very much in advance.
Author of the Futureverse here, including the parallelly and future packages.
When you use:
#SBATCH --ntasks=18
#SBATCH --cpus-per-task=8
Slurm will give you 18 parallel tasks, each allowed up to 8 CPU cores. With no further specifications, these 18 tasks may be allocated on a single host or across 18 hosts.
First, parallel::detectCores()
completely ignores what Slurm gives you. It reports on the number of CPU cores on the current machine's hardware. This will vary depending on which machine your main job script ends up running on. So, you don't want to use that. See https://www.jottr.org/2022/12/05/avoid-detectcores/ for more details on why detectCores()
is not a good idea.
Second, parallelly::availableCores()
respects what Slurm gives you. However, per design, it only reports on the number of CPU cores available on the current machine and to the current process (here, your main job process). Your main job process is only one (1) of the 18 tasks you requested. So, you don't want to use that either, unless you explicitly specify --ntasks=1
or --nodes=1
.
Instead, you want to look at parallelly::availableWorkers()
. It will report on what machines Slurm has allocated to your job and how many CPUs you were given on each of those machines. The length of this character vector will be the total number of parallel tasks Slurm has given you.
Next, R will not automagically run in parallel. You need to set up a parallel cluster and work with that. So, after you launch R (in your case via RStudio), you can use:
library(future)
plan(cluster) ## defaults to plan(cluster, workers = availableWorkers())
and then you'll have nbrOfWorkers()
parallel workers to play with when you use the future framework for parallelization, e.g.
library(future.apply)
y <- future_lapply(X, FUN = slow_fcn(x))
Warning: R itself has a limit of a maximum 125 parallel workers, and in practice fewer. See [parallelly::availableConnections()] for details. So, you need to lower the total number of parallel workers from you currently requested 144, e.g. use --ntasks=14
and --cpus-per-task=8
(= 112 parallel workers).
Here's a Slurm job script r-multihost.sh
that launches an R script illustrating how availableWorkers()
works:
#! /usr/bin/bash -l
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=8
echo "Started on: $(date --rfc-3339=seconds)"
echo "Hostname: $(hostname)"
echo "Working directory: $PWD"
## Run a small test R script using parallel workers
Rscript r-multihost.R
echo "Finished on: $(date --rfc-3339=seconds)"
Here's the R script r-multihost.R
called by the above job script:
library(future)
library(future.apply)
message(sprintf("Running R v%s", getRversion()))
ncores <- parallelly::availableCores()
message(sprintf("Number of CPU cores available on the current machine: %d", ncores))
workers <- parallelly::availableWorkers()
message(sprintf("Possible set of parallel workers: [n=%d] %s", length(workers), paste(workers, collapse = ", ")))
## Set up a cluster of parallel workers
t0 <- Sys.time()
message(sprintf("Setting up %d parallel workers ...", length(workers)), appendLF = FALSE)
plan(cluster, workers = workers)
message(sprintf("done [%.1fs]", difftime(Sys.time(), t0, units = "secs")))
message(sprintf("Number of parallel workers: %d", nbrOfWorkers()))
## Ask all parallel workers to respond with some info
info <- future_lapply(seq_len(nbrOfWorkers()), FUN = function(idx) {
data.frame(idx = idx, hostname = Sys.info()[["nodename"]], pid = Sys.getpid())
})
info <- do.call(rbind, info)
print(info)
print(sessionInfo())
When submitting this as sbatch r-multihost.sh
, you'd get something like:
Started on: 2023-04-03 12:32:31-07:00
Hostname: c4-n37
Working directory: /home/alice/r-parallel-example
Running R v4.2.2
Number of CPU cores available on the current machine: 8
Possible set of parallel workers: [n=16] c4-n37, c4-n37, c4-n37, c4-n37, c4-n37, c4-n37, c4-n37, c4-n37, c4-n38, c4-n38, c4-n38, c4-n38, c4-n38, c4-n38, c4-n3
8, c4-n38
Setting up 16 parallel workers ...done [50.2 s]
Number of parallel workers: 16
idx hostname pid
1 1 c4-n37 45529
2 2 c4-n37 45556
3 3 c4-n37 45583
4 4 c4-n37 45610
5 5 c4-n37 45638
6 6 c4-n37 45665
7 7 c4-n37 45692
8 8 c4-n37 45719
9 9 c4-n38 99981
10 10 c4-n38 100164
11 11 c4-n38 100343
12 12 c4-n38 100521
13 13 c4-n38 100699
14 14 c4-n38 100880
15 15 c4-n38 101058
16 16 c4-n38 101236
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /software/R/lib64/R/lib/libRblas.so
LAPACK: /software/R/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] future.apply_1.10.0 future_1.32.0
loaded via a namespace (and not attached):
[1] compiler_4.2.2 parallelly_1.35.0 parallel_4.2.2 tools_4.2.2
[5] listenv_0.9.0 rappdirs_0.3.3 codetools_0.2-19 digest_0.6.31
[9] globals_0.16.2
Finished on: 2023-04-03 12:33:30-07:00