Search code examples
configurationcluster-computingslurmhpc

Slurm cluster: configure node where not all cores have equal number threads


I've got a new node that I'm trying to add to my Slurm cluster. The cores on the new machine do not all have the same number of threads: 6 cores have 2 threads each and 4 cores have 1 thread each, a total of 16 CPUs. This is shown by lscpu -e:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ
  0    0      0    0 0:0:0:0          yes 6300.0000 800.0000
  1    0      0    0 0:0:0:0          yes 6300.0000 800.0000
  2    0      0    1 1:1:1:0          yes 6300.0000 800.0000
  3    0      0    1 1:1:1:0          yes 6300.0000 800.0000
  4    0      0    2 2:2:2:0          yes 6300.0000 800.0000
  5    0      0    2 2:2:2:0          yes 6300.0000 800.0000
  6    0      0    3 3:3:3:0          yes 6300.0000 800.0000
  7    0      0    3 3:3:3:0          yes 6300.0000 800.0000
  8    0      0    4 4:4:4:0          yes 6300.0000 800.0000
  9    0      0    4 4:4:4:0          yes 6300.0000 800.0000
 10    0      0    5 5:5:5:0          yes 6300.0000 800.0000
 11    0      0    5 5:5:5:0          yes 6300.0000 800.0000
 12    0      0    6 6:6:6:0          yes 3600.0000 800.0000
 13    0      0    7 7:7:6:0          yes 3600.0000 800.0000
 14    0      0    8 8:8:6:0          yes 3600.0000 800.0000
 15    0      0    9 9:9:6:0          yes 3600.0000 800.0000

When appending to slurm.conf I'll usually just copy over info from lscpu. For my new machine, the info is:

CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              1
Core(s) per socket:              10
Socket(s):                       1

I appended to slurm.conf the following: NodeName=MYNODE CPUs=16 SocketsPerBoard=1 CoresPerSocket=10 ThreadsPerCore=1. However, this raised the following error:

error: NodeNames=MYNODE CPUs=16 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.

Internally, it seems like slurm expects nodes to have cores with all the same number of threads. How can I correctly configure slurm.conf for my new node?


Solution

  • Try removing SocketsPerBoard=1 CoresPerSocket=10 ThreadsPerCore=1 and just specifying NodeName=MYNODE CPUs=16. If you specify both CPUS and Sockets, CoresPerSocket, etc. Slurm will try to make sense of the CPU value. If you do not specify them, Slurm will accept the CPU value you give it.