Search code examples
cudadoublegpgpugpu-warp

CUDA coalesced access of FP64 data


I am a bit confused with how memory access issued by a warp is affected by FP64 data.

  • A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right?
  • I have read that each time a thread in a warp tries to read/write the global memory, the warp accesses 128 bytes (32 single-precision floats). Right?
  • So if all the threads in a warp are reading different single precision floats (a total of 128 bytes) from the memory but in a coalesced manner, the warp will issue a single memory transaction. Right?

Here is my question now:

  • What if all threads in the warp try to access different double-precision floats (a total of 256 bytes) in a coalesced manner? Will the warp issue two memory transactions (128+128)?

PS: I am mostly interested in Compute Capability 2.0+ architectures


Solution

  • A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right?

    Correct

    I have read that each time a thread in a warp tries to read/write the global memory, the warp accesses 128 bytes (32 single-precision floats). Right?

    Not exactly. There are also 32 byte transaction sizes.

    So if all the threads in a warp are reading different single precision floats (a total of 128 bytes) from the memory but in a coalesced manner, the warp will issue a single memory transaction. Right?

    Correct

    What if all threads in the warp try to access different double-precision floats (a total of 256 bytes) in a coalesced manner? Will the warp issue two memory transactions (128+128)?

    Yes. The compiler will emit a 64 bit load instruction which will be serviced by two 128 byte transactions per warp when coalesced memory access is possible.