Search code examples
cudagpunvidiagpgpugpu-shared-memory

Shared memory bandwidth Fermi vs Kepler GPU


Has Kepler two times or four times the bandwidth of Fermi while accessing shared memory?

The Programming Guide states:

Each bank has a bandwidth of 32 bits per two clock cycles

for 2.X, and

Each bank has a bandwidth of 64 bits per clock cycle

for 3.X, so four times higher bandwidth is implied?


Solution

  • On Fermi, each SM has 32 banks delivering 32 bits on every two clock cycles.

    On Kepler, each SMX has 32 banks delivering 64 bits on every clock cycle. However since Kepler's SMX was fundamentally redesigned to be energy efficient, and since running fast clocks draws a lot of power, Kepler operates from a much slower core clock. Check out the Inside Kepler talk from GTC, about 8 minutes in, for more information.

    So the answer to the question is that Kepler has ~2x, not 4x.

    The next version of the documents (CUDA 5.0) should explain this better.