Search code examples
c++cmultiprocessingmpiopenmpi

what is the difference between host and slot in mulicore-multiprocessor environment?


i know that resource manager(R.M.) conveys slot information to openmpi but

  1. how R.M. determines number of slots in a multi-core processor(is 1 core == 1 slot always?) and,
  2. if i run a.out on a 4 core processor, then what is the difference between:

    • myshell$ mpirun --host n1,n1,n1,n1,n1 ./a.out
    • myshell$ mpirun -np 5 --host n1 ./a.out

I mean in which case ,am I "oversubscribing" the node ?


Solution

  • When it comes to resource managers, e.g. SLURM, LSF, SGE/OGE, Torque, etc., the mapping between slots and cores is entirely left to the system administrator. It depends usually on the nature of the jobs that are to be executed on the nodes. In HPC, where most tasks are CPU-bound, the usual mapping is one slot per core (or per hardware thread). In data processing, where most tasks are I/O bound, having more slots than cores could be more beneficial.

    The same applies to launching MPI processes. When the hosts are described in the hostfile, the number of slots per host does not necessarily have to match the hardware configuration. Again, it depends on the nature of the MPI job. Slot information is usually used to control how ranks are to be distributed. For example, the default policy in Open MPI is to fill the provided slots on the first host and then move to the next one. Once all hosts are filled, if more ranks remain to be launched, the process starts again from the first node in the host list.

    The end effect of --host n1,n1,n1,n1,n1 and --host n1 -np 5 is the same: 5 ranks are launched on host n1. The difference is how Open MPI interprets it.

    • mpiexec --host n1,n1,n1,n1,n1 ./a.out tells Open MPI that there are 5 slots on host n1. Since the -np parameter is omitted, mpiexec starts one rank per defined slot, therefore 5 ranks are started at host n1.
    • mpiexec --host n1 -np 5 ./a.out tells Open MPI that there is a single slot on host n1. One rank is launched on n1. Since no more slots are left, mpiexec starts again from the first defined slot, i.e. launches another rank on host n1. This is repeated until all 5 ranks are launched on n1 and results in it being oversubscribed.

    Note that the node is oversubscribed only from the point of view of the MPI library - one slot was provided on n1, but it had to start 5 ranks there. This has nothing to do with oversubscribing the node itself, i.e. there could be much more CPU free cores than 5.

    When the host list is provided by the resource manager, oversubscribing the nodes is a very bad idea, especially since some or all nodes might be shared with other jobs. In that case it is recommended that the --nooversubscribe option is used in order to prevent mpiexec from launching more ranks than slots granted. Note though that there are legitimate cases of oversubscription, e.g. when the nodes are granted exclusively (no sharing with other jobs) and the MPI job is I/O intensive.