Search code examples
parallel-processingcudamulti-gpu

CUDA: do I need different streams on multiple GPUs to execute in parallel?


I want to run kernels on multiple GPUs in parallel. For this purpose I switch between the devices using cudaSetDevice() and then start my kernel in the corresponding device. Now, usually all calls in one stream are executed sequentially and one has to use different streams if they shall be executed in parallel. Is this also the case when using different devices or can I in this case run my kernel calls on the default stream on both devices and they will still run in parallel?


Solution

  • It isn't necessary to use non-default streams per device to get concurrent execution of kernels on multiple devices from the same host process or thread. Kernel launches are asynchronous and non-blocking, so a tight loop with kernel launches on separate devices should produce execution overlap for non-trivial kernels (remember that device context switching has latency).

    It is necessary to use the asynchronous versions of all of the other host API calls you would typically use in conjunction with a kernel in the default stream, because many of those (the cudaMemcpy family, for example) block.