Multi-gpu kernel launch

I am wondering if there are any advantages/drawbacks of launching a kernel on multiple GPUs via cudaLaunchCooperativeKernelMultiDevice when no actual cooperation is happening vs traditional loop:

for loop over device ids
{
    cudaSetDevice(id);
    kernel<<<..., stream[i]>>>( ... );
}

cudaLaunchCooperativeKernelMultiDevice is definitely less code than a loop...

Solution

One possible drawback is that the multi grid cooperative launch mechanism is not supported on all multi-GPU systems, whereas the launch-in-a-loop method is.

So by using cudaLaunchCooperativeKernelMultiDevice you are restricting the footprint of systems where your code can run correctly to systems which have the cudaDevAttrCooperativeMultiDeviceLaunch property set.

Amongst the various limitations implied by this are not being able to run on systems with GPUs in WDDM mode, and not being able to run on systems where the GPUs are not all identical in terms of compute capability. You can read some of the other restrictions in the programming guide.