c++parallel-processing cuda nvidia gpu-shared-memory

Does moving data from global memory to shared memory stall the thread?

__shared__ float smem[2];
smem[0] = global_memory[0];
smem[1] = global_memory[1];

/*process smem[0]...*/
/*process smem[1]...*/

My question is, does smem[1] = global_memory[1]; block computation on smem[0]? In Cuda thread scheduling - latency hiding and Cuda global memory load and store they say memory reads will not stall the thread, until the read data is being used. Does storing it to shared memory count as "using the data"? Should I do something like this:

__shared__ float smem[2];
float a = global_memory[0];
float b = global_memory[1];

smem[0] = a;
/* process smem[0]*/

smem[1] = b;
/* process smem[1]*/

Or perhaps the compiler does it for me? But then does it use extra registers?

Solution

Yes, in the general case this would block the CUDA thread:

smem[0] = global_memory[0];

the reason is that this operation would be broken into two steps:

LDG  Rx, [Ry]
STS  [Rz], Rx

The first SASS instruction loads from global memory. This operation does not block the CUDA thread. It can be issued to the LD/ST unit, and the thread can continue. However the register target of that operation (Rx) is tracked, and if any instruction needs to use the value from Rx, the CUDA thread will stall at that point.

Of course the very next instruction is the STS (store shared) instruction that will use the value from Rx, so the CUDA thread will stall at that point (until the global load is satisfied).

Of course it's possible that the compiler may reorder the instructions so that the STS instruction occurs later, but there is no guarantee of that. Regardless, whenever the STS instruction is ordered by the compiler, the CUDA thread will stall at that point, until the global load is completed. For the example you have given, I think its quite likely that the compiler would create code that looks like this:

LDG  Rx, [Ry]
LDG  Rw, [Ry+1]
STS  [Rz], Rx
STS  [Rz+1], Rw

In other words, I think its likely that the compiler would organize these loads such that both global loads could be issued, before a possible stall occurs. However, there is no guarantee of this, and the specific behavior for your code can only be deduced by studying the actual SASS, but in the general case we should assume the possibility of a thread stall.

Yes, if you can break up the loads and stores as you have shown in your code, then this operation:

float b = global_memory[1];

should not block this operation:

smem[0] = a;
/* process smem[0]*/

Having said all that, CUDA introduced a new mechanism to address this scenario in CUDA 11, supported by devices of compute capability 8.0 and higher (so, all Ampere GPUs at this time). This new feature is referred to as asynchronous copy of data from global to shared memory. It allows for these copy operations to proceed without stalling CUDA threads. However this feature requires proper use of a barrier to make sure that when you need to actually use the data in shared memory, it is present.