multithreading performance multiprocessing mpi multicore

mpi_run on multicore architecture --bind-to l3 or --bind-to core

I am running a code on a 24c architecture and would like to use one mpi rank for each set of three cores bound to a L3 cache bloc. So, 8 mpi ranks per socket, 16 per node, with 3 threads per rank. I think the following command line should apply

mpirun --bind-to l3 -np 16 gmx_mpi mdrun -nt 3

--bind-to binding the mpi ranks to each bloc of L3 cache, -np allocating 16 mpi ranks per node and a -nt a number of threads per MPI rank of 3. Is this the correct approach ?

If the core is capable of multithreading (2 threads) is it right to write

mpirun --bind-to l3 -np 16 gmx_mpi mdrun -nt 6

--bind-to core is I assume binding one MPI rank per core, with no spanning into threads, or spanning into 2 threads per core for exploiting MT, e.g.

mpirun --bind-to core -np 48 gmx_mpi mdrun -nt 2

with 48 ranks one per core on a 2-socket platform and 2 threads per core (MT)

Would you confirm ?

Solution

the exact command seems to be --bind-to l3cache

mpirun --bind-to l3cache -np 16 gmx_mpi mdrun -nt 6