Search code examples
cudadynamic-parallelism

Dynamic parallelism - launching many small kernels is very slow


I am trying to use dynamic parallelism to improve an algorithm I have in CUDA. In my original CUDA solution, every thread computes a number that is common for each block. What I want to do is to first launch a coarse (or low resolution) kernel, where threads compute the common value just once (like if every thread represents one block). Then each thread creates a small grid of 1 block (16x16 threads), and launches a child kernel for it passing the common value. In theory it should be faster because one is saving many redundant operations. But in practice, the solution works very slow, I don't know why.

This is the code, very simplified, just the idea.

__global__ coarse_kernel( parameters ){
    int common_val = compute_common_val();
    dim3 dimblock(16, 16, 1);
    dim3 dimgrid(1, 1, 1);
    child_kernel <<< dimgrid, dimblock >>> (common_val, parameters);

}

__global__ child_kernel( int common_val, parameters ){
    // use common value
    do_computations(common_val, parameters);
}

The amount of child_kernels is a lot, one per thread and there must be around 400x400 threads. From what I understand, the GPU should process all these kernels in parallel, right?

Or child kernels are processed somehow sequentially?

My results show that performance is more than 10 times slower than in the original solution I had.


Solution

  • There is a cost in launching kernels, either parent or child. If your child kernels do not extract much parallelism and there is not much benefit against their non-parallel counterparts, then your faint benefit may be cancelled out by the child kernel launch overheads.

    In formulas, let to be the overhead to execute a child kernel, te its execution time and ts the time to execute the same code without the help of dynamic parallelism. The speedup arising from the use of dynamic parallelism is ts/(to+te). Perhaps (but this cannot be envinced from your code) te<ts but te,ts<<to, so that ts/(to+te) is about (ts/to)<1 and you observe a slowdown instead of a speedup.