Search code examples
cudainfinibandgpudirect

Setting up GPUDirect for infiniband


I try to setup GPUDirect to use infiniband verbs rdma calls directly on device memory without the need to use cudaMemcpy. I have 2 machines with nvidia k80 gpu cards each with driver version 367.27. CUDA8 is installed and Mellanox OFED 3.4 Also the Mellanox-nvidia GPUDirect plugin is installed:

-bash-4.2$ service nv_peer_mem status
nv_peer_mem module is loaded.

According to this thread "How to use GPUDirect RDMA with Infiniband" I have all the requirements for GPUDirect and the following code should run successfully. But it does not and ibv_reg_mr fails with the error "Bad Address" as if GPUDirect is not properly installed.

void * gpu_buffer;
struct ibv_mr *mr;
const int size = 64*1024;
cudaMalloc(&gpu_buffer,size); // TODO: Check errors
mr = ibv_reg_mr(pd,gpu_buffer,size,IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ);

Requested Info:
mlx5 is used.
Last Kernel log:

[Nov14 09:49] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 4430): umem get failed (-14)

Am I missing something? Do I need some other packets or do I have to activate GPUDirect in my code somehow?


Solution

  • A common reason for nv_peer_mem module failing is interaction with Unified Memory (UVM). Could you try disabling UVM:

    export CUDA_DISABLE_UNIFIED_MEMORY=1
    

    ?

    If this does not fix your problem, you should try running validation and copybw tests from https://github.com/NVIDIA/gdrcopy to check GPUDirectRDMA. If it works then your Mellanox stack is misconfigured.