I have some troubles with understanding shared memory organization in CUDA and 3 dumb questions.
The CUDA C programming guide says
There is an L1 cache for each multiprocessor and an L2 cache shared by all multiprocessors, both of which are used to cache accesses to local or global memory, including temporary register spills.
Further they call the L2 cache "shared memory".
Am I wrong saying that L2 cache shared by all SMs and shared memory which is used to store variables with __shared__
qualifier are different things?
I have a GeForce 630M with Compute Capability 2.1, so the result of following:
cudaDeviceProp info;
cudaGetDeviceProperties(&info,0);
printf("%d kB\n",info.sharedMemPerBlock/1024);
is 48 kB.
Does it mean that it is the total size of shared memory and maximum possible size of shared memory for one block? I mean, for example, I'm about to launch kernel using N blocks, so every block can use 48/N kB only?
Is there any difference between
extern __shared__ float shared[];
and
__shared__ float shared[];
?
You are correct. L2 cache and shared memory are two different things. The L2 is "shared" amongst all SMs. They do not call it "shared memory" that I can see. It is a single, device-wide resource. The fermi whitepaper will provide additional description of the relationship between the various resources. L2 is device wide. L1/Shared is a per-SM resource.
48KB is the maximum usable by a single threadblock, and it is also the maximum available (per SM) on the device. It is physically (not logically) shared by all threadblocks currently resident on the SM. If a single threadblock uses, say 14KB of shared memory, then (considering shared memory only here) up to 3 threadblocks may be resident at any given time on the SM. Those 3 threadblocks will be sharing the physical resource (48KB total, 14KB per threadblock), but they will each have their own logical section (ie. the threadblocks cannot see the shared memory belonging to another threadblock.) If a single threadblock used 40KB of shared memory, only one of those could be resident on the SM at any given time.
The difference is that the first case is dynamically allocated, and requires passing a shared memory size (allocation) as part of the kernel launch, and the second case is statically allocated, and requires that you specify the size:
__shared__ float shared[32];
^^
something is required here for a static allocation
you may want to refer to this recent question/answer for more discussion about the difference between static and dynamic __shared__
allocation, and the necessary kernel parameter for the dynamic case.
Newer GPUs provide various methods, in some cases, to exceed the 48KB limit, either per SM (physically) or per threadblock (logically).