Search code examples
mpiopenmpi

input to deterministically bind Open MPI ranks to NUMA node cores


Trying to bind an Open MPI rank to every core on a NUMA node machine. The machine has 2 nodes, each with 12 cores. The nodes doesnt have name, so I'm not able to do --host a:12,b:12 Also I want to bind extra ranks deterministically onto specific cores, how to do it when oversubscribe?

First question is: How to bind 24 ranks onto NUMA node cores? With one rank per core mpirun -n 24 --bind-to numa --report-bindings ./app This command will return There are not enough slots available in the system, while if I do lscpu, the output says CPU(s): 24, NUMA node(s): 2

Second question is: If I want to bind 27 ranks (oversubscribe) on the same NUMA machine, how to do it deterministically? When I want to oversubscribe 3 ranks, how to do it so that the extra 3 ranks will always bind to the same 3 cores each time I run the application?


Solution

  • How to bind 24 ranks onto NUMA node cores? With one rank per core

    First, as @Gilles pointed out: Make sure that your 24 cores are actual physical cores and not hyperthreads.

    The most secure (and also most tedious) is to give the ordered processor list to the mpirun command. For you that would be:

    mpirun -np 24 --report-bindings --cpu-list 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 --bind-to cpu-list:ordered ./app
    

    This gives you full control over which process is spawned where, which can be really helpful when certain processes communicate a lot. You can make sure they are close to each other from a hardware point of view. Or you want to avoid that certain cores are used, because other jobs are running on them.

    If I want to bind 27 ranks (oversubscribe) on the same NUMA machine, how to do it deterministically?

    First, that is something that should generally be avoided in production code. Especially for code, where each process does the same. It kills your efficiency, since your program is only as fast as the slowest core!

    That said, you should be able to use a rankfile for that, further described here and here

    For example, lets say you want to pin 4 processes on core 0:

    rankfile:

    rank 0=localhost slot=0:0
    rank 1=localhost slot=0:0
    rank 2=localhost slot=0:0
    rank 3=localhost slot=0:0
    

    And then

    mpirun -np 4 --report-bindings -rf rankfile ./app