Search code examples
linuxmultithreadingperformanceprofilingperf

Thread Utilization profiling on linux


Linux perf-tools are great for finding hotspots in CPU cycles and optimizing those hotspots. But once some parts are parallelized it becomes difficult to spot the sequential parts since they take up significant wall time but not necessarily many CPU cycles (the parallel parts are already burning those).

To avoid the XY-problem: My underlying motivation is to find sequential bottlenecks in multi-threaded code. The parallel phases can easily dominate the aggregate CPU-cycle statistics even though the sequential phases dominate wall time due to amdahl's law.

For java applications this is fairly easy to achieve with visualvm or yourkit which have a thread-utilization timelines.

yourkit thread timeline

Note that it shows both thread state (runnable, waiting, blocked) and stack samples for selected ranges or points in time.

How do I achieve something comparable with perf or other native profilers on linux? It doesn't have to be a GUI visualization, just a way to find sequential bottlenecks and CPU samples associated with them.

See also, a more narrow followup question focusing on perf.


Solution

  • Oracle's Developer Studio Performance Analyzer might do exactly what you're looking for. (Were you running on Solaris, I know it would do exactly what you're looking for, but I've never used it on Linux, and I don't have access right now to a Linux system suitable to try it on).

    This is a screenshot of a multithreaded IO test program, running on an x86 Solaris 11 system:

    Screenshot of multithreaded IO performance test prorgam

    Note that you can see the call stack of every thread along with seeing exactly how the threads interact - in the posted example, you can see where the threads that actually perform the IO start, and you can see each of the threads as they perform.

    This is a view that shows exactly where thread 2 is at the highlighted moment:

    enter image description here

    This view has synchronization event view enabled, showing that thread 2 is stuck in a sem_wait call for the highlighted period. Note the additional rows of graphical data, showing the synchronization events (sem_wait(), pthread_cond_wait(), pthread_mutex_lock() etc):

    enter image description here

    Other views include a call tree:

    enter image description here

    a thread overview (not very useful with only a handful of threads, but likely very useful if you have hundreds or more

    enter image description here

    and a view showing function CPU utilization

    enter image description here

    And you can see how much time is spent on each line of code:

    enter image description here

    Unsurprisingly, a process that's writing a large file to test IO performance spent almost all its time in the write() function.

    The full Oracle brief is at https://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf

    Quick usage overview: