Search code examples
performancehpcslurmscientific-computing

Optimize performance on a SLURM cluster


I am writing you after many attempts I have done on a CPU cluster so structured:

144 standard compute nodes 2× AMD EPYC 7742, 2× 64 cores, 2.25 GHz 256 (16× 16) GB DDR4, 3200 MHz InfiniBand HDR100 (Connect-X6) local disk for operating system (1× 240 GB SSD) 1 TB NVMe

Now, since my core-h are here limited, I want to maximize performance as much as I can. I am doing some benchmarking with the following submission script:

#!/bin/bash -x
#SBATCH --account=XXXX
#SBATCH --ntasks=256
#SBATCH --output=mp-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=24:00:00
#SBATCH --partition=batch

srun ./myprogram

The program I am running si Gromacs2020 (MPI), a Software to perform Molecular Dynamic Simualtions.

In the machine manual I read about these keys:

--ntasks
--ntasks-per-node
--cpu-per-node

However, considering the very recently technology, I am getting mediocre performances. Indeed, in a 5-years older cluster, I get better performance with comparable resources.

So, do you envision a good combination of those keywords to maximize performance and avoid core-h wasting? My system size is ~100K atoms (if it can help).

Any feedback would be very much appreciated,

Looking forward to hearing from your opinions.

Best Regards

VG


Solution

  • In your case, the 256 tasks have no constraints to run in the same rack, location or not. Slurm have no clues to schedule correctly the job on your cluster. It could be schedule 1 task on 256 different nodes, and that is not efficient at all.

    To be sure that all is schedule correctly, perhaps you should force to locate the tasks on the node.

    #!/bin/bash -x
    #SBATCH --account=XXXX
    #SBATCH --nodes=2
    #SBATCH --ntasks=256
    #SBATCH --ntasks-per-core=1
    #SBATCH --tasks-per-node=128
    #SBATCH --output=mp-out.%j
    #SBATCH --error=mpi-err.%j
    #SBATCH --time=24:00:00
    #SBATCH --partition=batch
    
    srun ./myprogram
    

    And normally, each 256 tasks will be schedule on 1 core on each AMD socket. and located on 2 nodes. This will avoid oversubscribing and cpu cycles sharing which is inefficient. To be sure and not be be disturb for benchmarking, ask --exclusive.