In the CUDA Programming Guide in the section about Cooperative Groups, there is an example of grid-local synchronization:
grid_group grid = this_grid();
grid.sync();
Unfortunately, I didn't found precise definition of grid.sync()
behavior. Is it correct to take the following definition given for __syncthreads
and extend it to grid level?
void __syncthreads();
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.
So, my question is this correct:
this_grid().sync();
waits until all threads in the grid have reached this point and all global and shared memory accesses made by these threads prior to this_grid().sync() are visible to all threads in the grid.
I doubt the correctness of this because in the CUDA Programming Guide, a couple of lines below grid.sync();
there is the following statement:
To guarantee the co-residency of the thread blocks on the GPU, the number of blocks launched needs to be carefully considered.
Does it mean that if I use so many threads so that there is no co-residency of thread blocks, I can end up in the situation where threads can deadlock?
The same question arises when I try to use coalesced_threads().sync()
. Is the following correct?
coalesced_threads().sync();
waits until all active threads in the warp have reached this point and all global and shared memory accesses made by these threads prior to coalesced_threads().sync() are visible to all threads in the list of active threads of warp.
Does the following example exits from while loop?
auto ct = coalesced_threads();
assert(ct.size() == 2);
b = 0; // shared between all threads
if (ct.thread_rank() == 0)
while (b == 0) {
// what if only rank 0 thread is always taken due to thread divergence?
ct.sync(); // does it guarantee that rank 0 will wait for rank 1?
}
if (ct.thread_rank() == 1)
while (b == 0) {
// what if a thread with rank 1 never executed?
b = 1;
ct.sync(); // does it guarantee that rank 0 will wait for rank 1?
}
To make the example above clear, without ct.sync()
it is unsafe and can deadlock (loop infinitely):
auto ct = coalesced_threads();
assert(ct.size() == 2);
b = 0; // shared between all threads
if (ct.thread_rank() == 0)
while (b == 0) {
// what if only rank 0 thread is always taken due to thread divergence?
}
if (ct.thread_rank() == 1)
while (b == 0) {
// what if a thread with rank 1 never executed?
b = 1;
}
So, my question is this correct:
this_grid().sync();
waits until all threads in the grid have reached this point and all global and shared memory accesses made by these threads prior to this_grid().sync() are visible to all threads in the grid.
Yes, that is correct, assuming you have a proper cooperative launch. A proper cooperative launch implies a number of things:
cudaGetLastError()
returns cudaSuccess
Does it mean that if I use so many threads so that there is no co-residency of thread blocks
If you violate the requirements for a cooperative launch, you are exploring undefined behavior. There is no point trying to definitively answer such questions, except to say that the behavior is undefined.
Regarding your statement(s) about coalesced threads, they are correct, although the wording must be understood carefully. active threads for a particular instruction is the same as coalesced threads.
In your example, you are creating an illegal case:
auto ct = coalesced_threads();
assert(ct.size() == 2); //there are exactly 2 threads in group ct
b = 0; // shared between all threads
if (ct.thread_rank() == 0) // this means that only thread whose rank is zero can participate in the next instruction - by definition you have excluded 1 thread
while (b == 0) {
// what if only rank 0 thread is always taken due to thread divergence?
// it is illegal to request a synchronization of a group of threads when your conditional code prevents one or more threads in the group from participating
ct.sync(); // does it guarantee that rank 0 will wait for rank 1?
}
two different .sync()
statements, in different places in the code, cannot satisfy the requirements of a single sync barrier. They each represent an individual barrier, whose requirements must be properly met.
Due to the illegal coding, this example also has undefined behavior; the same comments apply.