Search code examples
cudagpgputransfer

How the fastest can I transfer the data block of 256 bytes from one CUDA Block to another?


How the fastest can I transfer the data block of 256 bytes from one CUDA Block to another? And is there a way to transfer faster than global memory?


Solution

  • In theory, on devices of compute capability >= 2.0, transfers between blocks, using global memory, could be very fast because global memory transactions use the L1 and L2 caches.

    However, the only way to safely transfer memory between blocks is to launch those blocks in separate kernel invocations. Then, you lose the theoretical advantage I just described, as the caches are flushed between invocations.

    Within a given kernel invocation, you cannot know in which order your blocks will be launched.

    Transferring data between blocks launched by separate kernel invocations is a common paradigm in CUDA and if there is enough computational work to be done, the latency of the global memory transactions can be completely hidden.