Search code examples
cudamemory-bandwidth

CUDA: Memory performance, What is Global memory bandwidth


I am learning about CUDA optimizations. I found a presentation on this link: Optimizing CUDA by Paulius Micikevicius.

In this presentation, they talk about

MAXIMIZE GLOBAL MEMORY BANDWIDTH

, they say global memory coalescing will improve the bandwidth.

My question, How do you calculate the Global Memory Bandwidth. Can anyone explain me with a simple program example.


Solution

  • Theoretical bandwidth can be calculated using hardware spec.

    For example, the NVIDIA GeForce GTX 280 uses DDR RAM with a memory clock rate of 1,107 MHz and a 512-bit wide memory interface. Using these data items, the peak theoretical memory bandwidth of the NVIDIA GeForce GTX 280 is 141.6 GB/sec:

    enter image description here

    In this calculation, the memory clock rate is converted in to Hz, multiplied by the interface width (divided by 8, to convert bits to bytes) and multiplied by 2 due to the double data rate. Finally, this product is divided by 10^9 to convert the result to GB/sec (GBps).

    Effective bandwidth is calculated by timing specific program activities and by knowing how data is accessed by the program. To do so, use this equation:

    Effective bandwidth = (( Br + Bw ) / 10^9 ) / time

    Here, the effective bandwidth is in units of GBps, Br is the number of bytes read per kernel, Bw is the number of bytes written per kernel, and time is given in seconds.

    More information is available in CUDA best practice guide.