Using a loop in a CUDA graph

I have kernel A, B, and C which need to be executed sequentially.

A->B->C

They are executed in a while loop until some condition will be met.

while(predicate) {
    A->B->C
}

The while loop may be executed from 3 to 2000 times - information about a fact that a loop should stopped is produced by kernel C.

As the execution is related to multiple invocations of relatively small kernels CUDA Graph sounds like a good idea. However, CUDA graph implementation I have seen are all linear or tree-like without loops.

Generally, if the loop is not possible, the long chain of kernels of the length 2000 with possibility of early stop invoked from kernel C would be also OK. However, is it possible to stop the graph execution in some position by the call from inside of the kernel?

Solution

CUDA graphs have no conditionals. A vertex of the graph is visited/executed when its predecessors are complete, and that's that. So, fundamentally, you cannot do this with a CUDA graph.

What can you do?

Have a smaller graph for the loop iteration, and repeatedly schedule it.
Have A, B and C start their execution by checking the loop predicate - and skip all work if it holds. With that being the case, you can schedule many instances of A->B->C->A->B->C etc - which, starting at some point, will do nothing.
Don't rely on the CUDA graphs API. It's not a general-purpose parallel execution mechanism. :-(