rparallel-processingfutureslurm

Checking available cores in R on SLURM


I ran below script for SLURM RStudio setup (currently running):

#!/bin/bash
#SBATCH --job-name=nodes
#SBATCH --output=a.log
#SBATCH --ntasks=18
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=7gb


date;hostname;pwd

module load R/4.2
rserver                      <- runs RStudio server 

Which runs 8 cores with 18 nodes (144 cores).

However, when I check the number of cores available for parallel processing in the R console, it says 32 instead.

Here's the code for checking.

library(doParallel)
detectCores() # 32

Even worse, with another package, parallelly (or future) that considers the scheduler setting, it reports differently.

From parallely package:

For instance, if compute cluster schedulers are used (e.g. TORQUE/PBS and Slurm), they set specific environment variable specifying the number of cores that was allotted to any given job; availableCores() acknowledges these as well.)

library(parallelly)
availableCores() # 8

I am wondering if the current R is running with the above scheduler specification (144 cores) and if I am missing something important.

Also, could you recommend how to check available resources (core / memory) allocated and able to use in R with slurm setting?

Thank you very much in advance.


Solution

  • Author of the Futureverse here, including the parallelly and future packages.

    When you use:

    #SBATCH --ntasks=18
    #SBATCH --cpus-per-task=8
    

    Slurm will give you 18 parallel tasks, each allowed up to 8 CPU cores. With no further specifications, these 18 tasks may be allocated on a single host or across 18 hosts.

    First, parallel::detectCores() completely ignores what Slurm gives you. It reports on the number of CPU cores on the current machine's hardware. This will vary depending on which machine your main job script ends up running on. So, you don't want to use that. See https://www.jottr.org/2022/12/05/avoid-detectcores/ for more details on why detectCores() is not a good idea.

    Second, parallelly::availableCores() respects what Slurm gives you. However, per design, it only reports on the number of CPU cores available on the current machine and to the current process (here, your main job process). Your main job process is only one (1) of the 18 tasks you requested. So, you don't want to use that either, unless you explicitly specify --ntasks=1 or --nodes=1.

    Instead, you want to look at parallelly::availableWorkers(). It will report on what machines Slurm has allocated to your job and how many CPUs you were given on each of those machines. The length of this character vector will be the total number of parallel tasks Slurm has given you.

    Next, R will not automagically run in parallel. You need to set up a parallel cluster and work with that. So, after you launch R (in your case via RStudio), you can use:

    library(future)
    plan(cluster)   ## defaults to plan(cluster, workers = availableWorkers())
    

    and then you'll have nbrOfWorkers() parallel workers to play with when you use the future framework for parallelization, e.g.

    library(future.apply)
    y <- future_lapply(X, FUN = slow_fcn(x))
    

    Warning: R itself has a limit of a maximum 125 parallel workers, and in practice fewer. See [parallelly::availableConnections()] for details. So, you need to lower the total number of parallel workers from you currently requested 144, e.g. use --ntasks=14 and --cpus-per-task=8 (= 112 parallel workers).

    Here's a Slurm job script r-multihost.sh that launches an R script illustrating how availableWorkers() works:

    #! /usr/bin/bash -l
    #SBATCH --ntasks=2
    #SBATCH --cpus-per-task=8
    
    echo "Started on: $(date --rfc-3339=seconds)"
    echo "Hostname: $(hostname)"
    echo "Working directory: $PWD"
    
    ## Run a small test R script using parallel workers
    Rscript r-multihost.R
    
    echo "Finished on: $(date --rfc-3339=seconds)"
    

    Here's the R script r-multihost.R called by the above job script:

    library(future)
    library(future.apply)
    
    message(sprintf("Running R v%s", getRversion()))
    
    ncores <- parallelly::availableCores()
    message(sprintf("Number of CPU cores available on the current machine: %d", ncores))
    
    workers <- parallelly::availableWorkers()
    message(sprintf("Possible set of parallel workers: [n=%d] %s", length(workers), paste(workers, collapse = ", ")))
    
    ## Set up a cluster of parallel workers
    t0 <- Sys.time()
    message(sprintf("Setting up %d parallel workers ...", length(workers)), appendLF = FALSE)
    plan(cluster, workers = workers)
    message(sprintf("done [%.1fs]", difftime(Sys.time(), t0, units = "secs")))
    
    message(sprintf("Number of parallel workers: %d", nbrOfWorkers()))
    
    ## Ask all parallel workers to respond with some info
    info <- future_lapply(seq_len(nbrOfWorkers()), FUN = function(idx) {
      data.frame(idx = idx, hostname = Sys.info()[["nodename"]], pid = Sys.getpid())
    })
    info <- do.call(rbind, info)
    print(info)
    
    print(sessionInfo())
    

    When submitting this as sbatch r-multihost.sh, you'd get something like:

    Started on: 2023-04-03 12:32:31-07:00
    Hostname: c4-n37
    Working directory: /home/alice/r-parallel-example
    Running R v4.2.2
    Number of CPU cores available on the current machine: 8
    Possible set of parallel workers: [n=16] c4-n37, c4-n37, c4-n37, c4-n37, c4-n37, c4-n37, c4-n37, c4-n37, c4-n38, c4-n38, c4-n38, c4-n38, c4-n38, c4-n38, c4-n3
    8, c4-n38
    Setting up 16 parallel workers ...done [50.2 s]
    Number of parallel workers: 16
       idx hostname    pid
    1    1   c4-n37  45529
    2    2   c4-n37  45556
    3    3   c4-n37  45583
    4    4   c4-n37  45610
    5    5   c4-n37  45638
    6    6   c4-n37  45665
    7    7   c4-n37  45692
    8    8   c4-n37  45719
    9    9   c4-n38  99981
    10  10   c4-n38 100164
    11  11   c4-n38 100343
    12  12   c4-n38 100521
    13  13   c4-n38 100699
    14  14   c4-n38 100880
    15  15   c4-n38 101058
    16  16   c4-n38 101236
    R version 4.2.2 (2022-10-31)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: CentOS Linux 7 (Core)
    
    Matrix products: default
    BLAS:   /software/R/lib64/R/lib/libRblas.so
    LAPACK: /software/R/lib64/R/lib/libRlapack.so
    
    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
     [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
     [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
     [9] LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] future.apply_1.10.0 future_1.32.0
    
    loaded via a namespace (and not attached):
    [1] compiler_4.2.2    parallelly_1.35.0 parallel_4.2.2    tools_4.2.2      
    [5] listenv_0.9.0     rappdirs_0.3.3    codetools_0.2-19  digest_0.6.31    
    [9] globals_0.16.2   
    Finished on: 2023-04-03 12:33:30-07:00