Shared memory bandwidth Fermi vs Kepler GPU

Has Kepler two times or four times the bandwidth of Fermi while accessing shared memory?

The Programming Guide states:

Each bank has a bandwidth of 32 bits per two clock cycles

for 2.X, and

Each bank has a bandwidth of 64 bits per clock cycle

for 3.X, so four times higher bandwidth is implied?

Solution

On Fermi, each SM has 32 banks delivering 32 bits on every two clock cycles.

On Kepler, each SMX has 32 banks delivering 64 bits on every clock cycle. However since Kepler's SMX was fundamentally redesigned to be energy efficient, and since running fast clocks draws a lot of power, Kepler operates from a much slower core clock. Check out the Inside Kepler talk from GTC, about 8 minutes in, for more information.

So the answer to the question is that Kepler has ~2x, not 4x.

The next version of the documents (CUDA 5.0) should explain this better.

Fast & accurate atan/arctan approximation algorithm
What's the difference between strtok_r and strtok_s in C?
How memory address for pointer to arrays is same as an element in 2D array?
Which is the best way to suppress "unused variable" warning
How to use ellipsis in c's case statement?
How can I exclude non-numeric keys? CS50 Caesar Pset2
Fast ceiling of an integer division in C / C++
Is there an invalid pthread_t id?
How does SIMD (avx) processing work? for example, if I want 10 32 bit floats how do i fit in a 256 bit avx vector?
FDCAN problems on STM32G4
How does the call macro enable mutual recursion between functions f and g in this Hanoi Tower implementation?
Running test on Rocket core CPU - global variable initialized to 0 is unsuccessful, output wrong value instead
Interacting with C arrays without knowing the size
Combination of two strings
Avoiding strcpy overflow destination warning
carriage return by fgets
How to use special characters in C?
Why does 1.0/100.0 == 0.1/10.0 give True?
Is it correct to compare pointers in C?
Force free() to return malloc memory back to OS
How can I print to standard error in C with 'printf'?
What is the standard behavior of fread in C on Windows?
How is strtok removing lines it shouldn't have access to?
Using array as smart point in C
Assigning string to malloced 2d char array not working as intended
How to refactor repetition inside a Makefile?
Why does an empty preprocessor command still evaluate to something?
How to implement variable sized array within C struct
Character array typecasting to integer
Handling HTTP Headers in a Minimal C HTTP Server