I've got a new node that I'm trying to add to my Slurm cluster. The cores on the new machine do not all have the same number of threads: 6 cores have 2 threads each and 4 cores have 1 thread each, a total of 16 CPUs. This is shown by lscpu -e
:
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 yes 6300.0000 800.0000
1 0 0 0 0:0:0:0 yes 6300.0000 800.0000
2 0 0 1 1:1:1:0 yes 6300.0000 800.0000
3 0 0 1 1:1:1:0 yes 6300.0000 800.0000
4 0 0 2 2:2:2:0 yes 6300.0000 800.0000
5 0 0 2 2:2:2:0 yes 6300.0000 800.0000
6 0 0 3 3:3:3:0 yes 6300.0000 800.0000
7 0 0 3 3:3:3:0 yes 6300.0000 800.0000
8 0 0 4 4:4:4:0 yes 6300.0000 800.0000
9 0 0 4 4:4:4:0 yes 6300.0000 800.0000
10 0 0 5 5:5:5:0 yes 6300.0000 800.0000
11 0 0 5 5:5:5:0 yes 6300.0000 800.0000
12 0 0 6 6:6:6:0 yes 3600.0000 800.0000
13 0 0 7 7:7:6:0 yes 3600.0000 800.0000
14 0 0 8 8:8:6:0 yes 3600.0000 800.0000
15 0 0 9 9:9:6:0 yes 3600.0000 800.0000
When appending to slurm.conf I'll usually just copy over info from lscpu
. For my new machine, the info is:
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 1
I appended to slurm.conf the following: NodeName=MYNODE CPUs=16 SocketsPerBoard=1 CoresPerSocket=10 ThreadsPerCore=1
. However, this raised the following error:
error: NodeNames=MYNODE CPUs=16 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
Internally, it seems like slurm expects nodes to have cores with all the same number of threads. How can I correctly configure slurm.conf for my new node?
Try removing SocketsPerBoard=1 CoresPerSocket=10 ThreadsPerCore=1
and just specifying NodeName=MYNODE CPUs=16
. If you specify both CPUS
and Sockets
, CoresPerSocket
, etc. Slurm will try to make sense of the CPU
value. If you do not specify them, Slurm will accept the CPU value you give it.