RDMA read and write data placement/visibility semantics

I am trying to get more details on the RDMA read and write semantics (especially data placement semantics) and I would like to confirm my understanding with the experts here.

RDMA read :

Would the data be available/seen in the local buffer, once the RDMA read completion is seen in the completion queue. Is the behavior the same, if I am using GPU Direct DMA and the local address maps to GPU memory. Would the data be immediately available in GPU, once the RDMA READ completion is seen in completion queue. If it is not immediately available, what operation will make ensure it.

RDMA Write with Immediate (or) RDMA Write + Send:

Can the remote host check for presence of data in its memory, after it has seen the Immediate data in receive queue. And is the expectation/behavior going to change, if the Write is to GPU memory (using GDR).

Solution

RDMA read. Would the data be available/seen in the local buffer, once the RDMA read completion is seen in the completion queue?

Yes

Is the behavior the same, if I am using GPU Direct DMA and the local address maps to GPU memory?

Not necessarily. It is possible that the NIC has sent the data towards the GPU, but the GPU hasn't received it yet. Meanwhile the RDMA read completion has already arrived to the CPU. The root cause of this is PCIe semantics, which allow reordering of writes to different destination (CPU/GPU memory).

If it is not immediately available, what operation will make ensure it?

To ensure data has arrived to the GPU, one may set a flag on CPU following the RDMA completion and poll on this flag from GPU code. This works because the PCIe read issued by the GPU will "push" the NIC's DMA writes (according to PCIe ordering semantics).

RDMA Write with Immediate (or) RDMA Write + Send: Can the remote host check for presence of data in its memory, after it has seen the Immediate data in receive queue. And is the expectation/behavior going to change, if the Write is to GPU memory (using GDR).

Yes, this works, but GDR suffers from the same issue as above with writes arriving out-of-order to GPU memory as compared to CPU memory, again due to PCIe ordering semantics. The RNIC cannot control PCIe and therefore it cannot force the "desired" semantics in either case.