I am trying to control where I execute my MPI code. To do so there are several way, taskset, dplace, numactl or just the options of mpirun like --bind-to or -cpu-set.
The machine: is shared memory, 16 nodes, of 2 times 12cores (so 24 cores per nodes)
> numactl -H
available: 16 nodes (0-15)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 192 193 194 195 196 197 198 199 200 201 202 203
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 204 205 206 207 208 209 210 211 212 213 214 215
node 2 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 216 217 218 219 220 221 222 223 224 225 226 227
... (I reduce the output)
node 15 cpus: 180 181 182 183 184 185 186 187 188 189 190 191 372 373 374 375 376 377 378 379 380 381 382 383
node distances:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0: 10 50 65 65 65 65 65 65 65 65 79 79 65 65 79 79
1: 50 10 65 65 65 65 65 65 65 65 79 79 65 65 79 79
2: 65 65 10 50 65 65 65 65 79 79 65 65 79 79 65 65
3: 65 65 50 10 65 65 65 65 79 79 65 65 79 79 65 65
4: 65 65 65 65 10 50 65 65 65 65 79 79 65 65 79 79
5: 65 65 65 65 50 10 65 65 65 65 79 79 65 65 79 79
6: 65 65 65 65 65 65 10 50 79 79 65 65 79 79 65 65
7: 65 65 65 65 65 65 50 10 79 79 65 65 79 79 65 65
8: 65 65 79 79 65 65 79 79 10 50 65 65 65 65 65 65
9: 65 65 79 79 65 65 79 79 50 10 65 65 65 65 65 65
10: 79 79 65 65 79 79 65 65 65 65 10 50 65 65 65 65
11: 79 79 65 65 79 79 65 65 65 65 50 10 65 65 65 65
12: 65 65 79 79 65 65 79 79 65 65 65 65 10 50 65 65
13: 65 65 79 79 65 65 79 79 65 65 65 65 50 10 65 65
14: 79 79 65 65 79 79 65 65 65 65 65 65 65 65 10 50
15: 79 79 65 65 79 79 65 65 65 65 65 65 65 65 50 10
My code does not take advantage of the shared memory, I would like to use it as on distributed memory. But the processes seems to move and get too far from their data, so I would like to bind them and see if the performance is better.
What I have try so far:
the classic call mpirun -np 64 ./myexec param > logfile.log
Now I wanted to bind the run on the last nodes, lets say 12 to 15, with dplace or numactl (I do not see main difference...)
mpirun -np 64 dplace -c144-191,336-383 ./myexec param > logfile.log
mpirun -np 64 numactl --physcpubind=144-191,336-383 -l ./myexec param > logfile.log
(the main difference of the two is the -l of numactl that 'bound' the memory, but I am not even sure that it makes a difference..)
So, they both work well, the processes are bounded where I wanted to, BUT by looking closer to each process, it appears that some are allocated on the same core! so they are using only 50% of the core each! This happen even if the number of available core is larger than the number of processes! This is not good at all.
So I try to add some mpirun optin like --nooversubscribe but it changes nothing... I do not understand that. I also try with --bind-to none (to avoid conflict between mpirun and dplace/numactl), -cpus-per-proc 1 and -cpus-per-rank 1... not solving it.
So, I tried with only mpirun option
mpirun -cpu-set 144-191 -np 64 ./myexec param > logfile.log
but the -cpu-set option is not massively documented, and I do not find a way to bind one process per core.
The Question: May someone help me to have one process per core, on the cores that I want ?
Omit 336-383
from the list of physical CPUs in the numactl
command. Those are the second hardware threads and having them on the allowed CPU list permits the OS to schedule two processes on the different hardware threads of the same core.
Generally, with Open MPI, mapping and binding are two separate operations and the get both done on core bases, the following options are necessary:
--map-by core --bind-to core
The mapper starts by default from the first core on the first socket. To limit the core choice, pass the --cpu-set from-to
. In your case, the full command should be:
mpirun --cpu-set 144-191 --map-by core --bind-to core -np 64 ./myexec param > logfile.log
You can also pass the --report-bindings
option to get a nice graphical visualisation of the bindings (which in your case will be a bit hard to read...)
Note that --nooversubscribe
is used to prevent the library from placing more processes than there are slots defined on the node. By default there are as many slots as logical CPUs seen by the OS, therefore passing this option does nothing in your case (64 < 384).