Search code examples
hpcinfiniband

Can anybody please explain to me the relationship between libibverbs and librxe?


I am struggling to understand the relationship between libibverbs and librxe and the low-level kernel driver for the HCA.

Specifically, I have the following doubts :

  • When a packet arrives on the HCA, the low-level kernel driver passes the packet to the userspace application. There is a memory copy involved here. In this picture, where do libibverbs and librxe sit?
  • Similarly a send command issued by the user must be able to directly talk to the hardware via the low-level driver. What is the need to have the userspace libraries in this case?

Solution

  • The InfiniBand verbs implementation consists of roughly four components:

    • a vendor-specific kernel module (e.g. ib_mthca for Mellanox devices)
    • a kernel module that allows verbs access from userspace (ib_uverbs)
    • an user-space vendor driver library (e.g. libmthca)
    • a glue component between the previous two (libibverbs)

    InfiniBand supports in general two semantics - packet-based operation and remote DMA. No matter the mode of operation, both implement zero-copy by directly reading from and writing to the application buffer(s). This is done (as already explained by haggai_e) by fixing the buffer in physical memory (also called registering), thus preventing the virtual memory manager from swapping it off to the disk or moving it around in the physical RAM. A very nice feature of InfiniBand is that each HCA has its own virtual-to-physical address translation engine which allows one to pass userspace pointers directly to the hardware.

    The reason to have a user-level driver is that verbs exposes directly the HCA's hardware registers to the userspace and each HCA has a different set of registers, therefore the need for an intermediate userspace layer. Of course, it could be implemented entirely in the kernel and then a single vendor-independent userspace library could be used, but InfiniBand tries very hard to provide as low latency as possible and having to go through the kernel every time will be very expensive. The fact that RDMA devices can translate virtual addresses on their own means that the userspace library does not have to go through the kernel in order to obtain the physical address of the buffer when creating entries in the work queues (part of the mechanism used by verbs to send and receive data).

    Note that there are basically two vendor libraries - one in the kernel and one in userspace. The former provides verbs functionality to other kernel modules like file systems (e.g. Lustre) or network protocol drivers (e.g. IP-over-InfiniBand), while the latter provides that functionality in userspace. Some operations cannot be done entirely in userspace, e.g. registering memory or opening/closing device contexts, and those are transparently passed to the kernel module by libibverbs.

    Although technically RDMA over Converged Ethernet (RoCE, implemented in userspace as librxe) is not InfiniBand on the hardware level, the OpenFabrics stack is designed in such a way as to support RDMA-capable hardware other than InfiniBand HCAs, including RoCE and iWARP adapters.

    See this summary from Intel on the topic of accessing InfiniBand on Linux for more details.