Search code examples
slurm

SLURM: How to run 30 jobs on particular nodes only?


You need to run, say, 30 srun jobs, but ensure each of the jobs is run on a node from the particular list of nodes (that have the same performance, to fairly compare timings). How would you do it?

What I tried:

  • srun --nodelist=machineN[0-3] <some_cmd> : runs <some_cmd> on all the nodes simultaneously (what i need: to run <some_cmd> on one of the available nodes from the list)

  • srun -p partition seems to work, but needs a partition that contains exactly machineN[0-3], which is not always the case.

Ideas?


Solution

  • Update: Version 23.02 has fixed this, as can be read in the Release notes: Allow for --nodelist to contain more nodes than required by --nodes.


    You can go the opposite direction and use the --exclude option of sbatch:

    srun --exclude=machineN[4-XX] <some_cmd>
    

    Then slurm will only consider nodes that are not listed in the excluded list. If the list is long and complicated, it can be saved in a file.

    Another option is to check whether the Slurm configuration includes ''features'' with

    sinfo  --format "%20N %20f"
    

    If the 'features' column shows a comma-delimited list of features each node has (might be CPU family, network connection type, etc.), you can select a subset of the nodes with a specific features using

    srun --constraint=<some_feature> <some_cmd>