Search code examples
infinibandrdmamellanox

What is a Producer index (PI) in the context of ibv_exp_post_send and ibv_exp_post_task?


I am trying to use the Cross-Channel Communication support described in Appendix D to the RDMA Aware Programming User Manual. Unfortunately I am a bit confused as to the meanings of certain function arguments.

My Question

The ibv_exp_post_send() and ibv_exp_post_task() functions take a linked list of work request structs and a collection* of work request structs respectively. What is the meaning of cq_count and wqe_count in that struct?

struct ibv_exp_send_wr {
  // ...
  union {
    // ...          
    struct {
      struct ibv_cq   *cq; /* Completion queue (CQ) that WAIT WR relates to */
      int32_t  cq_count;   /* Producer index (PI) of the CQ */
    } cqe_wait; /* Describes IBV_EXP_WR_CQE_WAIT WR */
    struct {
      struct ibv_qp   *qp; /* Queue pair (QP) that SEND_EN/RECV_EN WR relates to */
      int32_t  wqe_count; /* Producer index (PI) of the QP */
    } wqe_enable; /* Desribes IBV_EXP_WR_RECV_ENABLE and IBV_EXP_WR_SEND_ENABLE WR */
  } task;
  // ...
};

Is the first work request/completion always numbered 1, with subsequent ones linearly increasing? Or does this sometimes reset, like between ibv_exp_post_task() calls or decrease after some requests have been handled? Are the numbers consistent between ibv_exp_post_send or ibv_exp_post_task?

*Technically, a pointer to a linked list of tasks each of which contain a linked list of work requests.


Solution

  • As far as I can see, the cq_count field's meaning depends on a flag that can be set in ibv_exp_send_wr.send_flags: IBV_EXP_SEND_WAIT_EN_LAST.

    If that flag is cleared, cq_count determines the relative offset of the completion to wait for. The offset is relative to an internal per-CQ field, that is updated only when the flag is set.

    Check out this code from the libmlx5 driver:

    case IBV_EXP_WR_CQE_WAIT:
    {
            struct mlx5_cq *wait_cq = to_mcq(wr->task.cqe_wait.cq);
            uint32_t wait_index = 0;
    
            wait_index = wait_cq->wait_index +
                            wr->task.cqe_wait.cq_count;
            wait_cq->wait_count = max(wait_cq->wait_count,
                            wr->task.cqe_wait.cq_count);
    
            if (exp_send_flags & IBV_EXP_SEND_WAIT_EN_LAST) {
                    wait_cq->wait_index += wait_cq->wait_count;
                    wait_cq->wait_count = 0;
            }
    
            set_wait_en_seg(seg, wait_cq->cqn, wait_index);
            seg   += sizeof(struct mlx5_wqe_wait_en_seg);
            size += sizeof(struct mlx5_wqe_wait_en_seg) / 16;
    }
    break;
    

    wqe_count behaves similarly, except it can also accept the a special value of zero, asking the driver to enable all previously posted work requests.