Search code examples

infiniband rdma poor transfer bw

In my application I use an infiniband infrastructure to send a stream of data from a server to another one. I have used to easy the development ip over infiniband because I'm more familiar with socket programming. Until now the performance (max bw) was good enough for me (I knew I wasn't getting the maximum bandwith achievable), now I need to get out from that infiniband connection more bandwidth.

ib_write_bw claims that my max achievable bandwidth is around 1500 MB/s (I'm not getting 3000MB/s because my card is installed in a PCI 2.0 8x).

So far so good. I coded my communication channel using ibverbs and rdma but I'm getting far less than the bandwith I can get, I'm even getting a bit less bandwidth than using socket but at least my application doesn't use any CPU power:

ib_write_bw: 1500 MB/s

sockets: 700 MB/s <= One core of my system is at 100% during this test

ibvers+rdma: 600 MB/s <= No CPU is used at all during this test

It seems that the bottleneck is here:

ibv_sge sge;
sge.addr = (uintptr_t)memory_to_transfer;
sge.length = memory_to_transfer_size;
sge.lkey = memory_to_transfer_mr->lkey;

ibv_send_wr wr;
memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.opcode = IBV_WR_RDMA_WRITE;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.rdma.remote_addr = (uintptr_t)thePeerMemoryRegion.addr;
wr.wr.rdma.rkey = thePeerMemoryRegion.rkey;

ibv_send_wr *bad_wr = NULL;
if (ibv_post_send(theCommunicationIdentifier->qp, &wr, &bad_wr) != 0) {
  notifyError("Unable to ibv post receive");

at this point the next code waiting for completation that is:

//Wait for completation
ibv_cq *cq;
void* cq_context;
if (ibv_get_cq_event(theCompletionEventChannel, &cq, &cq_context) != 0) {
  notifyError("Unable to get a ibv cq event");

ibv_ack_cq_events(cq, 1);

if (ibv_req_notify_cq(cq, 0) != 0) {
  notifyError("Unable to get a req notify");

ibv_wc wc;
int myRet = ibv_poll_cq(cq, 1, &wc);
if (myRet > 1) {
  LOG(WARNING) << "Got more than a single ibv_wc, expecting one";

The time from my ibv_post_send and when ibv_get_cq_event returns an event is 13.3ms when transfering chuncks of 8 MB achieving then around 600 MB/s.

To specify more (in pseudocode what I do globally):

Active Side:

post a message receive
rdma connection
wait for rdma connection event
<<at this point transfer tx flow starts>>
register memory containing bytes to transfer
wait remote memory region addr/key ( I wait for a ibv_wc)
send data with ibv_post_send
post a message receive
wait for ibv_post_send event ( I wait for a ibv_wc) (this lasts 13.3 ms)
send message "DONE"
unregister memory 
goto start

Passive Side:

post a message receive
rdma accept
wait for rdma connection event
<<at this point transfer rx flow starts>>
register memory that has to receive the bytes
send addr/key of memory registered
wait "DONE" message 
unregister memory
post a message receive
goto start

Does anyone knows what I'm doing wrong? Or what I can improve? I'm not affected by "Not Invented Here" syndrome so I'm even open to throw away what I have done until now and adopting something else. I only need a point to point contiguous transfer.


  • I solved the issue allocating my buffers to be transmitted alligned to the page size. In my system page size is 4K (value returned by sysconf(_SC_PAGESIZE)). Doing so I'm able (I still do the registration/unregistration) to reach now around 1400 MB/sec.