I try to setup GPUDirect to use infiniband verbs rdma calls directly on device memory without the need to use cudaMemcpy. I have 2 machines with nvidia k80 gpu cards each with driver version 367.27. CUDA8 is installed and Mellanox OFED 3.4 Also the Mellanox-nvidia GPUDirect plugin is installed:
-bash-4.2$ service nv_peer_mem status
nv_peer_mem module is loaded.
According to this thread "How to use GPUDirect RDMA with Infiniband" I have all the requirements for GPUDirect and the following code should run successfully. But it does not and ibv_reg_mr fails with the error "Bad Address" as if GPUDirect is not properly installed.
void * gpu_buffer;
struct ibv_mr *mr;
const int size = 64*1024;
cudaMalloc(&gpu_buffer,size); // TODO: Check errors
mr = ibv_reg_mr(pd,gpu_buffer,size,IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ);
Requested Info:
mlx5 is used.
Last Kernel log:
[Nov14 09:49] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 4430): umem get failed (-14)
Am I missing something? Do I need some other packets or do I have to activate GPUDirect in my code somehow?
A common reason for nv_peer_mem module failing is interaction with Unified Memory (UVM). Could you try disabling UVM:
export CUDA_DISABLE_UNIFIED_MEMORY=1
?
If this does not fix your problem, you should try running validation
and copybw
tests from https://github.com/NVIDIA/gdrcopy to check GPUDirectRDMA. If it works then your Mellanox stack is misconfigured.