Two consecutive kernels or whole-grid cooperative group synchronization?

Suppose I have two tasks to run on a GPU, the second of which relying on essentially all work by the first. Traditionally, I would essentially have to write these tasks as two separate kernels and schedule the second to run at some point after the first. But - with CUDA 9, I can now synchronize on the entire grid concluding its work on the first task - using the cooperative groups feature, then proceed to have the grid do its second-task work.

My questions are:

Can we provide a rule-of-thumb regarding when it is better, in terms of performance to write two kernels, and when to use whole-grid synchronization?
If so, what would it be?
If not - why is it difficult to determine which is preferable in which cases?

Solution

Making this a CW answer so others feel free to add their opinions and edit.

The grid-wide sync feature in cooperative groups carries with it a requirement to limit the thread complement (size of the grid) to whatever is the carrying capacity of the GPU you are running on. This isn't a major performance limiter, but it requires you to write code that can flexibly use different grid sizes while still achieving max performance. grid-stride loops are a typical component of such a coding strategy.

Therefore the grid-wide sync feature will often require careful coding and additional code overhead (e.g. use of the occupancy API) to achieve maximum performance, especially compared to simple or naive kernels.

To offset this possible reduction in programmer productivity, some possible benefits are:

In the situation where launch overhead is a significant portion of the overall runtime, then the cooperative grid-wide sync may afford significant benefit. In addition to the fusion of 2 separate kernels, algorithms that may call kernels in a loop, for example jacobi iteration/relaxation, or other timestep simulation algorithms, may benefit noticeably, since the launch loop can effectively be "moved onto the GPU", replacing a loop of kernel launches with a single kernel call.
In the situation where there is a significant amount of on-chip "state" (e.g. register contents, shared memory contents), which is needed to be loaded before the grid-wide sync, and will be used after the grid-wide sync, then cooperative groups may be a significant win, saving the time in the kernel that would have followed that grid-wide sync, that would have been used to re-load state. This appears to have been the motivation here (see section 4.3) for example. I'm not suggesting they were using cooperative groups (they were not). I'm suggesting they were motivated to seek a grid-wide sync, using makeshift methods available at the time, to eliminate both the cost of state reload, plus possibly the cost of kernel launch overhead.