Search code examples
mpislurmhpc

How to make SLURM job step use the minimum number of nodes?


I am trying to run many smaller SLURM job steps within one big multi-node allocation, but am struggling with how the tasks of the job steps are assigned to the different nodes. In general I would like to keep the tasks of one job step as local as possible (same node, same socket) and only spill over to the next node when not all tasks can be placed on a single node.

The following example shows a case where I allocate 2 nodes with 4 tasks each and launch a job step asking for 4 tasks:

$ salloc -N 2 --ntasks-per-node=4 srun -l -n 4 hostname
salloc: Granted job allocation 9677936
0: compute-3-6.local
1: compute-3-6.local
2: compute-3-7.local
3: compute-3-7.local
salloc: Relinquishing job allocation 9677936

I would like these 4 tasks go to one of the nodes, so that a second job step could claim the other node, but this is not what happens: the first job step gets distributed evenly across the two nodes. If I launched a second job step with 4 tasks, it would be distributed across nodes too, causing lots of unnecessary inter-node network communication that could easily be avoided.

I have already found out that I can force my job step to run on a single node by explicitly including -N 1 for the job step launch:

$ salloc -N 2 --ntasks-per-node=4 srun -l -N 1 -n 4 hostname
salloc: Granted job allocation 9677939
0: compute-3-6.local
1: compute-3-6.local
2: compute-3-6.local
3: compute-3-6.local
salloc: Relinquishing job allocation 9677939

However, the number of job steps launched and the number of tasks per job step depends on user input in my case, so I can not just force -N 1 for all of them. There may be job steps that have so many tasks that they can not be placed on a single node.

Reading the srun manpage, I first thought that the --distribute=block:block option should work for me, but it does not. It seems that this option only comes into play after the decision on the number of nodes to be used by a job step has been made.

Another idea that I had was that the job step might just be inheriting the -N 2 argument from the allocation and was therefore also forced to use two nodes. I tried setting -N 1-2 for the job step, in order to at least allow SLURM to assign the job step to a single node, but this does not have any effect for me, not even when combined with the --use-min-nodes flag.

$ salloc -N 2 --ntasks-per-node=4 srun -l -N 1-2 --use-min-nodes -n 4 hostname
salloc: Granted job allocation 9677947
0: compute-3-6.local
1: compute-3-6.local
2: compute-3-7.local
3: compute-3-7.local
salloc: Relinquishing job allocation 9677947

How do I make a SLURM job step use the minimum number of nodes?


Solution

  • Unfortunately, there is no other way. You have to use -N.

    Even if you use -n 1 (instead of 4) there will be a warning:

    salloc -N 2 --ntasks-per-node=4 srun -l -n 1 hostname
    srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
    

    But if you use,

    salloc -N 2 --ntasks-per-node=4 srun -l -N 1-2 --use-min-nodes -n 1 hostname
    

    there won't be any warning here, because then a minimum of one node will be used.

    Reason: slurm will try to launch at least a single task in the number of nodes allocated/requested, unless specified otherwise using -N flag (like the below output).

    srun -l -N 1-2 --use-min-nodes -m plane=48 -n 4 hostname
    0: compute-3-6.local
    1: compute-3-6.local
    2: compute-3-6.local
    3: compute-3-7.local
    

    You can see, one task is launched in node 2 while the remaining in other node alone. This is because, your allocation requested two nodes (salloc). If you want to run on one node you have to specify it with -N variable to force it to use only single node.

    I guess you could calculate -N on the fly to address your issue. Since you know the max tasks possible in a node (assuming it is a homogeneous system), then you can calculate the number of nodes an application needed before launching the tasks using srun.

    However, the number of job steps launched and the number of tasks per job step depends on user input in my case, so I can not just force -N 1 for all of them. There may be job steps that have so many tasks that they can not be placed on a single node.