profiling visualization coroutine greenlets

Are there some common techniques to profile coroutine based code?

It's pretty obvious how to visualize a regular call stack and count internal and external execution times. However, if one have dealt with coroutines, the call stack can look pretty messy. I mean, a coroutine may yield execution not to its parent but to another coroutine (eg. greenlet). Are there some common ways to make consistent profiling output for such scenarios?

Solution

Think about a single sample, of the stack for all threads at the same time.

What you need to know is - who's waiting for whom, and why. Normally if function A is above B on a stack, it means A is waiting for B to return, and the reason is that A wanted B to do something. If you look at a whole stack, for one thread, you get a chain of reasons why that particular nanosecond is being spent, by that thread. If you're looking for speed, you're looking for chains of reasons that, altogether, you don't really need (because there is a weak link). This works even if the chain ends in I/O. If it is user input it's simply waiting for the user. But if it's output, or disk I/O, or plain old CPU cranking, you might be able to do something to reduce it, and get a performance gain (if you see the same problem on 2 or more samples).

What if thread A is waiting for thread B? Then what you see at the bottom of A's stack is a function that waits for the other thread. You need to figure out which is thread B, and look at its stack, because the longer it takes, the longer A takes. So this is more difficult, but surely you're not afraid of that.

I'm talking about manual profiling here, where you take samples yourself, in a debugger, and apply your full attention to each sample. Profiling tools tend to assume you're lazy and only want numbers, and if nothing jumps out of those numbers you will be happy because you found nothing. In fact, if some silly needless activity is taking 30% of time, then on average the number of samples you require to see it twice is 2/0.3 = 6.67 samples (not a big number), and it is quite likely that you will see it and the profiler will not. That's random pausing.