Search code examples
linux-kernelperfflamegraph

How does linux's perf utility understand stack traces?


Linux's perf utility is famously used by Brendan Gregg to generate flamegraphs for c/c++, jvm code, nodejs code, etc.

Does the Linux kernel natively understand stack traces? Where can I read more about how a tool is able to introspect into stack traces of processes, even if processes are written in completely different languages?


Solution

  • There is short introduction about stack traces in perf by Gregg: http://www.brendangregg.com/perf.html

    4.4 Stack Traces

    Always compile with frame pointers. Omitting frame pointers is an evil compiler optimization that breaks debuggers, and sadly, is often the default. Without them, you may see incomplete stacks from perf_events ... There are two ways to fix this: either using dwarf data to unwind the stack, or returning the frame pointers.

    Dwarf

    Since about the 3.9 kernel, perf_events has supported a workaround for missing frame pointers in user-level stacks: libunwind, which uses dwarf. This can be enabled using "-g dwarf". ... compiler optimizations (-O2), which in this case has omitted the frame pointer. ... recompiling .. with -fno-omit-frame-pointer:

    Non C-style languages may have different frame format, or may omit frame pointers too:

    4.3. JIT Symbols (Java, Node.js)

    Programs that have virtual machines (VMs), like Java's JVM and node's v8, execute their own virtual processor, which has its own way of executing functions and managing stacks. If you profile these using perf_events, you'll see symbols for the VM engine .. perf_events has JIT support to solve this, which requires the VM to maintain a /tmp/perf-PID.map file for symbol translation.

    Note that Java may not show full stacks to begin with, due to hotspot on x86 omitting the frame pointer (just like gcc). On newer versions (JDK 8u60+), you can use the -XX:+PreserveFramePointer option to fix this behavior, ...

    The Gregg's blog post about Java and stack traces: http://techblog.netflix.com/2015/07/java-in-flames.html ("Fixing Frame Pointers" - fixed in some JDK8 versions and in JDK9 by adding option on program start)

    Now, your questions:

    How does linux's perf utility understand stack traces?

    perf utility basically (in early versions) just parses data returned from linux kernel's subsystem "perf_events" (or sometimes "events"), accessed with syscall perf_event_open. For call stack trace there are options PERF_SAMPLE_CALLCHAIN / PERF_SAMPLE_STACK_USER:

    sample_type PERF_SAMPLE_CALLCHAIN Records the callchain (stack backtrace).

              PERF_SAMPLE_STACK_USER (since Linux 3.7)
                     Records the user level stack, allowing stack unwinding.
    

    Does the Linux kernel natively understand stack traces?

    It may understand (if implemented) and may not, depending on your cpu architecture. The function of sampling (getting/reading call stack from live process) callchain is defined in architecture-independent part of kernel as __weak with empty body:

    http://lxr.free-electrons.com/source/kernel/events/callchain.c?v=4.4#L26

     27 __weak void perf_callchain_kernel(struct perf_callchain_entry *entry,
     28                                   struct pt_regs *regs)
     29 {
     30 }
     31 
     32 __weak void perf_callchain_user(struct perf_callchain_entry *entry,
     33                                 struct pt_regs *regs)
     34 {
     35 }
    

    In 4.4 kernel user-space callchain sampler is redefined in architecture-dependent part of kernel for x86/x86_64, ARC, SPARC, ARM/ARM64, Xtensa, Tilera TILE, PowerPC, Imagination Meta:

    http://lxr.free-electrons.com/ident?v=4.4;i=perf_callchain_user

    arch/x86/kernel/cpu/perf_event.c, line 2279
    arch/arc/kernel/perf_event.c, line 72
    arch/sparc/kernel/perf_event.c, line 1829
    arch/arm/kernel/perf_callchain.c, line 62
    arch/xtensa/kernel/perf_event.c, line 339
    arch/tile/kernel/perf_event.c, line 995
    arch/arm64/kernel/perf_callchain.c, line 109
    arch/powerpc/perf/callchain.c, line 490
    arch/metag/kernel/perf_callchain.c, line 59
    

    Reading of call chain from user stack may be not trivial for some architectures and/or for some modes.

    What CPU architecture you use? What languages and VM are used?

    Where can I read more about how a tool is able to introspect into stack traces of processes, even if processes are written in completely different languages?

    You may try gdb and/or debuggers for the language or backtrace function of libc or support of read-only unwinding in libunwind (there is local backtrace example in libunwind, show_backtrace()).

    They may have better support of frame parsing / better integration with virtual machine of the language or with unwind info. If gdb (with backtrace command) or other debuggers can't get stack traces from running program, there may be no way of getting stack trace at all.

    If they can get call trace, but perf can't (even after recompiling with -fno-omit-frame-pointer for C/C++), it may be possible to add support of such combination of architecture + frame format into perf_events and perf.

    There are several blogs with some info about generic backtracing problems and solutions:

    Dwarf support for perf_events/perf: