When compiling a CUDA program that launches kernels on multiple devices, does nvcc internally compile a version of the kernel for each device?
I ask this because I am trying to use PyCUDA and am struggling with why I have to compile (call SourceModule) the kernel code for each device I am about to launch the kernel on.
Thanks for your help!
The one word answer is No. The compiler doesn't know or need to know anything about the number of GPUs during compilation. The runtime API will automagically load code from binary payloads into each context without the compiler or programmer needing to do anything. If your code requires JIT recompilation, the driver will compile once and the cached machine code will be reused on subsequent contexts, if the hardware targets are the same.
In PyCUDA, you are using the driver API, so context management is more manual. You have to load modules into the context of each GPU you are using. If you are using the source module feature, that means you need to submit code for each GPU. But (IIRC), PyCUDA does also do caching of its JIT compiled code done with nvcc. So even though you need to call the source module for every context, you shouldn't get compiler invocation every time, if the GPUs are the same. If this bothers you and you are not doing a lot of meta-programming, consider switching to pre-compiled cubins. You still need to load them into every context, but there is no compilation overhead at runtime.