Search code examples
juliadistributed-computingslurm

How to tell Julia to utilize more than one node when starting job on cluster?


I am running a computation in julia on our cluster using the slurm manager.

The jobs are started using a slurm batch script of the form:

#SBATCH --partition=partition
#SBATCH --nodes=2

export JULIA_NUM_THREADS=16

julia --optimize=3 --compile=all --threads=16 --project myScript.jl

In myScript.jl I am now using the Distributed.jl package to add multiple worders each utilizing 16 threads (set by the export statement above) to perform some parallel computation:

using Distributed

const Nworkers = 10
addprocs(Nworkers - 1)

@sync for n in 1:Nworkers
    @spawnat :any longComputation()
end

This works well when only utilizing a single node, but when requesting more than one node, only one of them is in reality utilized.

How can I bring julia to use all available resource in such a batch call?


Solution

  • While I don't use multithreading, I do use Distributed with Julia over multiple nodes on a SLURM cluster daily. I think the key problem here is how you are spawning your workers. Under the hood, SlurmManager is using srun when creating the procs so that it can use workers across multiple nodes. When you just type addprocs(10), it has no idea about workers that are not on the node the main script is running. Try adding procs with:

    addprocs(SlurmManager())
    

    This will add as many many procs as you have tasks in your SBATCH. For me that is 256.

    #SBATCH --nodes=4
    #SBATCH --ntasks-per-node=64
    

    Like you, I also want to remove one proc to keep the node that is hosting under utilized. You can simply rmprocs(2) (to just remove worker number 2) after you initialize all of the workers. However, I noticed that you are still running a loop from 1:Nworkers instead of 1:(Nworkers-1).

    To use multithreading in the workers generated via addprocs(SlurmManager()), I suspect you will need to make sure that --threads=16 is passed to the workers. You may be able to do that directly via the usual addprocs exeflags handling (see docs).