CUDA-aware MPI for two GPUs within one K80

I am trying to optimize the performance of a MPI+CUDA benchmark called LAMMPS (https://github.com/lammps/lammps). Right now I am running with two MPI processes and two GPUs. My system has two sockets and each socket connects to 2 K80. Since each K80 contains 2 GPUs internally, each socket actually connects to 4 GPUs. But I am only using 2 cores in one socket and 2 GPUs (1 K80) connected to that socket. The MPI compiler is MVAPICH2 2.2rc1 and the CUDA compiler version is 7.5.

That was the background. I profiled the application and found the communication was the performance bottleneck. And I suspect it is because no GPUDirect technique was applied. So I switch to MVAPICH2-GDR 2.2rc1 and installed all other required libraries and tools. But MVAPICH2-GDR requires Infiniband interface card which is not available on my system, so I have runtime error "channel initialization failed. No active HCAs found on the system". Based on my understanding, the Infiniband card is not required if we only want to use the GPUs within 1 K80 on one node, because K80 has an internal PCIe switch for those two GPUs. These are my doubts. To make the questions clear, I list them as follows:

In my system, one socket connects to two K80. If two GPUs in one K80 need to communicate with the GPUs in another K80, then we must have IB card if we want to use GPUDirect, is that right?
If we only need to use the two GPUs within 1 K80, then the communication between these two GPUs does not require IB card, right? However, MVAPICH2-GDR requires at least one IB card. So is there any workaround to solve this issue? Or I have to plugin a IB card on the system?

Solution

In my system, one socket connects to two K80. If two GPUs in one K80 need to communicate with the GPUs in another K80, then we must have IB card if we want to use GPUDirect, is that right?

The only time an IB card is needed is when you have MPI communications (GPU or otherwise) that are going from system to system. GPUs in the same system do not need an IB card to be present in order to communicate with each other. More information about using GPUDirect in this (single-system) setting is below.

If we only need to use the two GPUs within 1 K80, then the communication between these two GPUs does not require IB card, right? However, MVAPICH2-GDR requires at least one IB card. So is there any workaround to solve this issue? Or I have to plugin a IB card on the system?

The GDR in MVAPICH2-GDR refers to GPUDirect-RDMA. GPUDirect is a general umbrella term for a set of technologies that allow GPUs to communicate directly with each other.

For GPUs in the same system, the GPUDirect technology is called Peer-to-Peer. The two GPUs on a K80 should always be able to communicate with each other using Peer to Peer, and you can validate this for yourself using the CUDA sample codes that have P2P in the name such as simpleP2P. This sample code will also tell you if your system is able to support P2P between any 2 GPUs in the same system.

For GPUs in separate systems that are connected by IB (Infiniband) networking, there is an additional GPUDirect technology called GPUDirect-RDMA. This allows two GPUs in separate systems to communicate with each other over the IB link.

So, since MVAPICH2-GDR incorporates GPUDirect RDMA, which relates to IB, it probably will by default be looking for an IB card.

However you should be able to get communication benefit by using a GPUDirect-enabled MPI (including some flavors of MVAPICH2) even between the GPUs in a single system, for example with K80. This kind of usage is simply called "CUDA-aware MPI", because it uses GPUDirect P2P but not necessarily RDMA.

A detailed tutorial and walkthrough of how to set that up is beyond what I can offer in a SO answer, but for more information on this kind of usage, I would refer you to two blog articles that cover this topic thoroughly, the first one is here, the second part is here. More information on GPUDirect-RDMA is here.