Search code examples
slurm

SLURM: how to limit CPU job count to avoid wasting GPU resource?


We use SLURM to share CPU and GPU resources among nodes. Sometimes GPU jobs cannot be submitted because someone filled resources with CPU jobs. In that case, it wastes GPU resources.

How can I make the policy to avoid this conflict?

For example, is it possible to limit the maximum CPU job count on nodes for GPU jobs?

(node resource: 48 CPU cores, 4 GPU card, --> limit CPU jobs up to 44 to reserve 4 GPU jobs.)


Solution

  • A configuration that is sometimes used to do that is to have two (overlapping) partitions, one with all the nodes (CPU partition), and the other one with only the GPU nodes (GPU partition).

    You then set MaxCPUsPerNode for the CPU partition to 44, and to 4 for the GPU partition.

    Then, GPU jobs must be submitted to the GPU partition and the CPU only jobs to the CPU partition (which can be the default). That can be enforced either with "resource limits" or a "job submit" plugin