How to measure the register bandwidth of an NVIDIA GPU?

I want to test the register bandwidth of an NVIDIA GPU (OpenCL/CUDA). How to do that？

I can't find any information about the register bandwidth test on the Internet, only the bandwidth test of the cache at all levels.

Solution

Registers have 0 clock cycle access latency and are bound to the GPU core clock frequency.

Say a GPU has 10 TFlops/s compute throughput with FP32 fused-multiply-add instructions. Each FMA instruction does 2 Flops, loads 3 FP32 inputs from registers and writes 1 FP32 output in registers. Each FP32 number is 4 Bytes. That makes 5 Trillion FMA calls per second, accessing 20 Trillion FP32 numbers per second, with a combined register bandwidth of 80TB/s.

So GPU register bandwidth is (TFlops/s for FP32) × (8 Byte/Flop). This is valid for all GPUs.

The FP32 TFlops/s you can measure for example with this OpenCL-Benchmark tool.

Fast ceiling of an integer division in C / C++
Is there an invalid pthread_t id?
How does SIMD (avx) processing work? for example, if I want 10 32 bit floats how do i fit in a 256 bit avx vector?
How memory address for pointer to arrays is same as an element in 2D array?
FDCAN problems on STM32G4
How does the call macro enable mutual recursion between functions f and g in this Hanoi Tower implementation?
Running test on Rocket core CPU - global variable initialized to 0 is unsuccessful, output wrong value instead
Interacting with C arrays without knowing the size
Combination of two strings
Avoiding strcpy overflow destination warning
carriage return by fgets
How to use special characters in C?
Why does 1.0/100.0 == 0.1/10.0 give True?
Is it correct to compare pointers in C?
How can I exclude non-numeric keys? CS50 Caesar Pset2
Force free() to return malloc memory back to OS
How can I print to standard error in C with 'printf'?
What is the standard behavior of fread in C on Windows?
How is strtok removing lines it shouldn't have access to?
Using array as smart point in C
Assigning string to malloced 2d char array not working as intended
How to refactor repetition inside a Makefile?
Why does an empty preprocessor command still evaluate to something?
How to implement variable sized array within C struct
Character array typecasting to integer
Handling HTTP Headers in a Minimal C HTTP Server
How to get the sign, mantissa and exponent of a floating point number
Why do MCU libraries use logic operations instead of bitfield structs?
What kind of implementation can I use for a static associative array on a vintage system with very limited resources?
Clarification - Struct Bitfield memory layout