In Infiniband, what mapping in PCIe-BAR, the internal buffer of Infiniband card or the remote computer's RAM?

As we know, Infiniband allows RDMA - direct access to the memory of the remote computer.

It is also known, that the PCI-Express (endpoint) devices, including the PCIe-card Infiniband, are able to communicate in two ways:

IO Ports (in / out) deprecated
MMIO (BAR - memory mapped IO: mov)

But what exactly is displayed in the BAR (MMIO)? (when using PCIe-cards Infiniband):

Its own internal buffer memory card Infiniband?
part of RAM of a remote computer (the portion of RAM for which at the moment is copying data with RDMA)?

When I use Infiniband, what mapping in PCIe-BAR, the internal buffer of Infiniband card or the remote computer's RAM?

Solution

The exact contents and usage of the MMIO space for an HCA are going to be vendor- and maybe card-specific. It does seem like it would be a simple approach to implementing RDMA to have the card and its driver set up an MMIO region directly corresponding to the remotely mapped memory. However, thinking of that MMIO region as being one and the same as the remote memory is only true at a certain level of abstraction, probably at the Infiniband verbs library (pdf) layer or above.

Below that level, the protocol stack traversed by an RDMA write posted by an application on host A destined for host B will look something like:

Code executing on A does a mov into an MMIO address.
The memory management hardware on A's CPU socket recognizes the address as PCIe MMIO and sends the request either directly out of the CPU socket to PCIe if the HCA is plugged into the appropriate slot, or to the southbridge as a DMI packet or similar, and then in turn over PCIe to the HCA.
The HCA unwraps the RDMA payload from the PCIe packet and does whatever it needs to do to handle sending that information out to the Infiniband fabric. Most likely this involves buffering the payload in some small memory on the HCA itself for a short time before sending the RDMA request out in an IB packet.

After traversing the IB fabric, the payload follows a roughly reverse series of steps on B:

B's HCA receives the IB packet and unwraps the payload.
The HCA possibly buffers the payload on the card itself for a short time while constructing a PCIe packet.
The PCIe packet traverses B's motherboard, possibly getting translated into DMI or other formats along the way, before arriving at B's DMA controller.
B's DMA controller arbitrates writing the payload to the region of B's system memory pinned for such RDMA transactions.

The key step that makes RDMA faster than competing technologies is B4. Before any RDMA reads or writes took place, the Infiniband driver and verbs library on B set up a region of memory so that B's DMA controller would be able to safely write to it without any further context switches or OS or driver handling. That means that besides whatever latency was added by the wire transmission (usually on the order of microseconds), processing by the receiving end adds very little latency to the transaction. This step and the corresponding MMIO mapping on A also allow for zero-copy transfers, where neither side has to copy the memory of interest between kernel and user space to intermediate between the HCA, driver and application.

Here's a good link describing how this looks at a software level if you're calling IB verbs:

http://thegeekinthecorner.wordpress.com/2010/09/28/rdma-read-and-write-with-ib-verbs/