For my CUDA development, I am using a machine with 16 cores, and 1 GTX 580 GPU with 16 SMs. For the work that I am doing, I plan to launch 16 host threads (1 on each core), and 1 kernel launch per thread, each with 1 block and 1024 threads. My goal is to run 16 kernels in parallel on 16 SMs. Is this possible/feasible?
I have tried to read as much as possible about independent contexts, but there does not seem to be too much information available. As I understand it, each host thread can have its own GPU context. But, I am not sure whether the kernels will run in parallel if I use independent contexts.
I can read all the data from all 16 host threads into one giant structure and pass it to GPU to launch one kernel. However, it will be too much copying and it will slow down the application.
While a multi-threaded application can hold multiple CUDA contexts simultaneously on the same GPU, those contexts cannot perform operations concurrently. When active, each context has sole use of the GPU, and must yield before another context (which could include operations with a rendering API or a display manager) can have access to the GPU.
So in a word, no this strategy can't work with any current CUDA versions or hardware.