I have kernel A, B, and C which need to be executed sequentially.
A->B->C
They are executed in a while loop until some condition will be met.
while(predicate) {
A->B->C
}
The while loop may be executed from 3 to 2000 times - information about a fact that a loop should stopped is produced by kernel C.
As the execution is related to multiple invocations of relatively small kernels CUDA Graph sounds like a good idea. However, CUDA graph implementation I have seen are all linear or tree-like without loops.
Generally, if the loop is not possible, the long chain of kernels of the length 2000 with possibility of early stop invoked from kernel C would be also OK. However, is it possible to stop the graph execution in some position by the call from inside of the kernel?
CUDA graphs have no conditionals. A vertex of the graph is visited/executed when its predecessors are complete, and that's that. So, fundamentally, you cannot do this with a CUDA graph.
What can you do?