pydrake: How do I identify slow Python LeafSystem's (to possibly rewrite in C++)?

I am prototyping a simple Drake simulation. I have some simple Python LeafSystems that implement controllers, and find that without these systems, my simulation can run at realtime; however, with these systems, my simulation runs much slower than realtime.

I don't think it's the math, but instead just the overhead of Python vs. C++.

For this code:
https://github.com/EricCousineau-TRI/repro/tree/2e3865a7aefe8adc19a6ff69e84025def03da7fd/drake_stuff/python_profiling

If I try to use Python's cProfile and then use snakeviz to visualize the results, I can see that my Python code seems slow, but I can't tell how it compares to the C++ Drake code that pydrake is binding.

Without Python LeafSystems (--no_control):

With the Python LeafSystem:

My tracepoint is in main(), but it does not appear in either of those.

How do I get better information about relative timing, without rolling my own timers?

Solution

I'm not sure if this is best answer, but I found this post: https://stackoverflow.com/a/61253170/7829525

py-spy seems like an excellent tool for seeing relative performance information for Python code that involves CPython API extensions.

From my naive usage mentioned here:
https://github.com/benfred/py-spy/issues/531
https://github.com/EricCousineau-TRI/repro/tree/6048da3/drake_stuff/python_profiling

I can now see more information.

Looking at interactive SVG flamegraphs from py-spy with default rate of 100 samples/sec:

Without Python LeafSystems (--no_control):

With Python LeafSystems:

Per @nicho's suggestion below, using py-spy --native can provide much better detail. Using Ctrl+F for .py:, here's what it looks like:

Without Python LeafSystems (--no_control):

With Python LeafSystems: