Search code examples
slurmsbatch

sbatch slurm jobs which release CPU resources individually


I have a cluster of many nodes with many cores and I simply want to run thousands of jobs that just require a single CPU each on them. Preferably with sbatch. After going through the documentation for several hours I still run into problems. My current setup is:

#SBATCH --nodes=4
#SBATCH --tasks-per-node=25
#SBATCH --distribution=block

srun ./my_experiment

I start several of these with sbatchand they seem to queue up nicely.

This script starts 100 instances of my_experiment which is intended. Unfortunately they seem to hog the resources of all 100 CPUs even if 99 experiments already ended. How do I alleviate this?

Secondly they don't seem to share nodes with each other. Even though the nodes have +40 cores.

Is it even possible to sbatch a bunch of tasks and have them release their resources individually?


Solution

  • Unfortunately they seem to hog the resources of all 100 CPUs even if 99 experiments already ended.

    That is because you create one job, spanning at least 4 nodes, requesting 25 CPUs per node for 25 tasks. Jobs release their allocation at the end of all tasks.

    Assuming there is no communication between your processes, your workflow seems better suited for job arrays. With job arrays, the idea is to create many jobs that are independent but easily manageable in sets.

    #SBATCH --ntasks=1
    #SBATCH --array=1-100
    
    srun ./my_experiment
    

    You will end up with 100 jobs, starting and ending independently one of another, but which you can kill with a single command.

    If your program my_experiment used the SLURM_PROC_ID environment variable, you can replace it with the SLURM_ARRAY_TASK_ID .

    Secondly they don't seem to share nodes with each other. Even though the nodes have +40 cores.

    You require explicitly 25 cores per node for each job, so if the core count is not at least 50, Slurm cannot place two of such jobs on the same node. If the core count is larger than 50 it might be due to memory requirements.