Search code examples
mpiopenmpimpi4py

How nodes communicate in OpenMPI


I am able to run OpenMPI job in multiple nodes under ssh. Everything looks good but I find that I do not know much about what is really happening. So, how nodes communicate in OpenMPI? It's in multiple nodes, hence it can not be shared memory. It also seems not TCP or UDP because I have not configured any port. Can anyone describe what happened when a message is sent and received between 2 processes in 2 nodes? THANKS!


Solution

  • Open MPI is built on top of a framework of frameworks called Modular Component Architecture (MCA). There are frameworks for different activities such as point-to-point communication, collective communication, parallel I/O, remote process launch, etc. Each framework is implemented as a set of components that provide different implementations of the same public interface.

    Whenever the services of a specific framework are requested for the first time, e.g., those of the Byte Transfer Layer (BTL) or the Matching Transport Layer (MTL), both of which transfer messages between the ranks, MCA enumerates through the various components capable of fulfilling the requirements and tries to instantiate them. Some components have specific requirements on their own, e.g., require specific hardware to be present, and fail to instantiate if those aren't met. All components that instantiate successfully are scored and the one with the best score is chosen to carry out the request and other similar requests. Thus, Open MPI is able to adapt itself to different environments will very little configuration on the user side.

    For communication between different ranks, the BTL and MTL frameworks provide multiple implementations and the set depends heavily on the Open MPI version and how it was built. The ompi_info tool can be used to query the library configuration. This is an example from one of my machines:

    $ ompi_info | grep 'MCA [mb]tl'
                     MCA btl: openib (MCA v2.1.0, API v3.0.0, Component v2.1.1)
                     MCA btl: sm (MCA v2.1.0, API v3.0.0, Component v2.1.1)
                     MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v2.1.1)
                     MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v2.1.1)
                     MCA btl: self (MCA v2.1.0, API v3.0.0, Component v2.1.1)
                     MCA mtl: psm (MCA v2.1.0, API v2.0.0, Component v2.1.1)
                     MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v2.1.1)
    

    The different components listed here are:

    • openib -- uses InfiniBand verbs to communicate over InfiniBand networks, which is one of the most widespread high-performance communication fabric for clusters nowadays, and other RDMA-capable networks such as iWARP or RoCE
    • sm -- uses shared memory to communicate on the same node
    • tcp -- uses TCP/IP to communicate over any network that provides a sockets interface
    • vader -- similarly to sm, provides shared memory communication on the same node
    • self -- provides efficient self-communication
    • psm -- uses the PSM library to communicate over networks derived from PathScale's InfiniBand variant, such as Intel Omni-Path (r.i.p.)
    • ofi -- alternative InfiniBand transport that uses OpenFabrics Interfaces (OFI) instead of verbs

    The first time rank A on hostA wants to talk to rank B on hostB, Open MPI will go through the list of modules. self only provides self-communication and will be excluded. sm and vader will get excluded since they only provide communication on the same node. If your cluster is not equipped with a high-performance network, the most likely candidate to remain is tcp, because there is literally no cluster node that doesn't have some kind of Ethernet connection to it.

    The tcp component probes all network interfaces that are up and notes their network addresses. It opens listening TCP ports on all of them and publishes this information on a central repository (usually managed by the mpiexec process used to launch the MPI program). When the MPI_Send call in rank A requests the services of tcp in order to a send message to rank B, the component looks up the information published by rank B and selects all IP addresses that are in any of the networks that hostA is part of. It then tries to create one or more TCP connections and upon success the messages start flowing.

    In most cases, you do not need to configure anything and the tcp component Just Works™. Sometimes though it may need some additional configuration. For example, the default TCP port range may be blocked by a firewall and you may need to tell it to use a different one. Or it may select network interfaces that have the same network range, but do not provide physical connectivity - typical case are the virtual interfaces used by the various hypervisors or container services. In this case, you have to tell tcp to exclude those interfaces.

    Configuring the various MCA components is done by passing in MCA parameters with the --mca param_name param_value command-line argument of mpiexec. You may query the list or parameters that a given MCA component has and their default values with ompi_info --param framework component. For example:

    $ ompi_info --param btl tcp
                     MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v2.1.1)
                 MCA btl tcp: ---------------------------------------------------
                 MCA btl tcp: parameter "btl_tcp_if_include" (current value: "",
                              data source: default, level: 1 user/basic, type:
                              string)
                              Comma-delimited list of devices and/or CIDR
                              notation of networks to use for MPI communication
                              (e.g., "eth0,192.168.0.0/16").  Mutually exclusive
                              with btl_tcp_if_exclude.
                 MCA btl tcp: parameter "btl_tcp_if_exclude" (current value:
                              "127.0.0.1/8,sppp", data source: default, level: 1
                              user/basic, type: string)
                              Comma-delimited list of devices and/or CIDR
                              notation of networks to NOT use for MPI
                              communication -- all devices not matching these
                              specifications will be used (e.g.,
                              "eth0,192.168.0.0/16").  If set to a non-default
                              value, it is mutually exclusive with
                              btl_tcp_if_include.
                 MCA btl tcp: parameter "btl_tcp_progress_thread" (current value:
                              "0", data source: default, level: 1 user/basic,
                              type: int)
    

    Parameters have different levels and by default ompi_info only shows parameters of level 1 (user/basic parameters). This can be changed with the --level N argument to show parameters up to level N. The levels go all the way up to 9 and those with higher levels are only required in very advanced cases, such as fine-tuning the library or debugging issues. For example, btl_tcp_port_min_v4 and btl_tcp_port_range_v4, which are used in tandem to specify the port range for TCP connections, are parameters of level 2 (user/detail).