Search code examples
cudagpuopenclnvidia

How to measure the register bandwidth of an NVIDIA GPU?


I want to test the register bandwidth of an NVIDIA GPU (OpenCL/CUDA). How to do that?

I can't find any information about the register bandwidth test on the Internet, only the bandwidth test of the cache at all levels.


Solution

  • Registers have 0 clock cycle access latency and are bound to the GPU core clock frequency.

    Say a GPU has 10 TFlops/s compute throughput with FP32 fused-multiply-add instructions. Each FMA instruction does 2 Flops, loads 3 FP32 inputs from registers and writes 1 FP32 output in registers. Each FP32 number is 4 Bytes. That makes 5 Trillion FMA calls per second, accessing 20 Trillion FP32 numbers per second, with a combined register bandwidth of 80TB/s.

    So GPU register bandwidth is (TFlops/s for FP32) × (8 Byte/Flop). This is valid for all GPUs.

    The FP32 TFlops/s you can measure for example with this OpenCL-Benchmark tool.