I'm currently playing with the new CUDA Dynamic Parallelism (CDP) feature introduced in CUDA 5.0. I picked the N-queens puzzle as an example for tree search algorithms with high work-imbalance, which, in my opinion, may benefit from CDP.
My approach is roughly as follows: For a given board configuration (a chess board with a certain numbers of queens already placed in the first rows) I start a kernel with a number of threads. Each thread tries one possible branch of the sub-tree below the given configuration up to a given max. depth. If the leave of the branch still represents a valid configuration, that thread spawns a child grid of threads that search the sub-tree that is then based on that configuration. Threads that find their configuration to be invalid (two or more queens could attack each other) will terminate. If a thread successfully placed the last queen on the board it increments the solution counter.
Before launching the kernels I pre-calculate some board configurations on the CPU and then launch a grid for each of those configuration.
Now to the problem: I found my solution to be significantly slower then another CUDA implementation that does not use CDP. So I started the Nsight profiler to find the reason. Here is my first result (for N=10):
Obviously the GPU is not fully occupied. So I figured that I need to use different streams for launching the child grids in order to prevent them from waiting on each other. Here is the profiling result when using a new stream for each child grid launch:
This looks more dense (and is faster), but I still don't quite understand the reason for this pattern. Why are there so many gabs between some launches (especially in the end)?
But it gets even stranger. When I increment the N (and thus the workload) to 13, the pattern looks as follows:
Does someone know how CDP works internally? Are there any implicit synchronization barriers I did not consider yet? Or am I reading the profiler output wrong? I'm particularly curious what this one thread spanning over almost the entire execution time in the last case could be.
I also didn't find any documentation for the Nsight Visual Profiler concerning the output for CDP. Any good references about what Nsight is showing there would help as well.
Thanks!
I can answer your profiler questions. I'll reference the last image in your question.
The "Compute" row shows a summary of all kernel executions happening in a context (in your case you seem to have only one context on the device, so the compute row is therefore showing all activity on the device).
Sub-rows within the Compute row are used to show concurrent kernel executions (note that this is true for both CDP and non-CDP applications). These sub-rows are created dynamically as necessary depending on how much concurrency there is. The placement of kernels within certain sub-rows does not indicate anything... they are just packed in using a heuristic that tries to minimize the number of needed sub-rows. The height of the sub-rows is also scaled to keep vertical space reasonable. In your case you have a lot of concurrency at some points so it looks like you have 30+ sub-rows, which means that as some points you have 30+ kernels executing simultaneously on the device.
The goal of the Compute row is to give an overview of how busy the GPU is over time. To see specifically what kernels are executing you need to explore the kernel rows in the timeline.
The kernel timelines are shown just below the Context timeline. There is one row for each host-launched kernel. Like the Compute row, a kernel row can have sub-rows if needed to show concurrent execution of that kernel. In the case of CDP applications, a kernel row can also have child kernel rows that represent the kernels launched from that kernel.
For your example I see a single host-launched kernel "nQue...". Note that there are 7 sub-rows for this kernel since at some points there are 7 instances of the kernel executing at the same time. For kernels that don't launch child kernels, the kernel execution is represented by a solid interval, showing the time that that instance of the kernel was executing on the GPU. For kernels that launch child kernels, the kernel execution can also include a hollow part at the end. The hollow part represents the time after the kernel has finished executing where it is waiting for child kernels to finish executing. The CDP execution model requires that a parent kernel not complete until all child kernels complete and this is what the hollow part is showing.
You can see the exact parent/child relationships by expanding a parent kernel timeline. Notice the "+" icon next to the "nQue..." timeline row. If you open that you will see all the child kernels launched by any of the "nQue..." kernels in the timeline row. Those child kernels can in turn be expanded if they launched children of there own.
If you want to see the "family tree" of a particular kernel there are two ways to do it. You can select a kernel and all ancestors and descendants will highlight in the kernel rows and the compute rows. You can see this in your image. Ctrl-select works as well to multi-select. If you select a single kernel and right-click you can select "Focus" in the menu and this will hide all the kernels except ancestors/descendants of that kernel. Use "Don't Focus" from that same menu to restore all the kernels.
To your specific question about the long executing kernel. It appears to be one of the "nQue..." instances launched from the host. I can't say why it is executing so long without knowing more about the code. You could select it and use the Focus feature to see exactly what child kernels it is launching and perhaps that would give some insight.