Search code examples
linuxlinux-kernelftrace

How can I use ftrace filtering to see the call stack above a function?


It's well-documented how to use ftrace to find the function graph starting from a certain function, e.g.

# echo nop > current_tracer
# echo 100 > max_graph_depth
# echo ksys_dup3 > set_graph_function
# echo function_graph > current_tracer
# cat trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
  7)               |  ksys_dup3() {
  7)   0.533 us    |    expand_files();
  7)               |    do_dup2() {
  7)               |      filp_close() {
  7)   0.405 us    |        dnotify_flush();
  7)   0.459 us    |        locks_remove_posix();
  7)               |        fput() {
  7)               |          fput_many() {
  7)               |            task_work_add() {
  7)   0.533 us    |              kick_process();
  7)   1.558 us    |            }
  7)   2.475 us    |          }
  7)   3.382 us    |        }
  7)   6.122 us    |      }
  7)   7.104 us    |    }
  7) + 10.763 us   |  }

But this only returns the function graph starting at ksys_dup3. It omits the full function graph that leads to ksys_dup3:

  7)               |  el0_svc_handler() {
  7)               |    el0_svc_common() {
  7)               |      __arm64_sys_dup3() {
  7)               |        ksys_dup3() {
  7)   0.416 us    |          expand_files();
  7)               |          do_dup2() {
  7)               |            filp_close() {
  7)   0.405 us    |              dnotify_flush();
  7)   0.406 us    |              locks_remove_posix();
  7)               |              fput() {
  7)   0.416 us    |                fput_many();
  7)   1.269 us    |              }
  7)   3.819 us    |            }
  7)   4.746 us    |          }
  7)   6.475 us    |        }
  7)   7.381 us    |      }
  7)   8.362 us    |    }
  7)   9.205 us    |  }

Is there a way to use ftrace to filter a full function graph?


Solution

  • I'd say all of ftrace is well documented (here). There is no way to do what you want though, because the filtering ability of ftrace is implemented through triggers that can start/stop tracing exactly when an event happens. Setting set_graph_function will trigger a trace start when the function is entered and a trace stop when it's exited. Since when you enter el0_svc_handler you cannot know beforehand if ksys_dup3 is going to be called, there is no way to write a trigger to start the trace on such a condition (that would require being able to "predict the future").

    You can however do fine grained filtering with the set_ftrace_pid parameter writing a program that only does what you need, running a full trace on that program and filtering on the parent el0_svc_handler.


    In case you don't actually know which parent functions are being called before the one you want, you can do a single targeted run with the func_stack_trace option enabled to get an idea about what the entire call chain is, and then use that output to set the appropriate filter for a "normal" run.

    For example, let's say I want to trace do_dup2, but as you say I want to start from one of the parent functions. I'll first write a dummy test program like the following one:

    int main(void) {
        printf("My PID is %d, press ENTER to go...\n", getpid());
        getchar();
        dup2(0, 1);
        return 0;
    }
    

    I'll compile and start the above program in one shell:

    $ ./test
    My PID is 1234, press ENTER to go...
    

    Then in another root shell, configure tracing as follows:

    cd /sys/kernel/tracing
    echo function > current_tracer
    echo 1 > options/func_stack_trace
    echo do_dup2 > set_ftrace_filter
    echo PID_OF_THE_PROGRAM_HERE > set_ftrace_pid
    

    Note: echo do_dup2 > set_ftrace_filter is very important otherwise you will trace and dump the stack for every single kernel function, which would be a huge performance hit and can make the system unresponsive. For the same reason doing echo PID_OF_THE_PROGRAM_HERE > set_ftrace_pid is also important if you don't want to trace every single dup2 syscall done by the system.

    Now I can do one trace:

    echo 1 > tracing_on
    # ... press ENTER in the other shell ...
    echo 0 > tracing_on
    cat trace
    

    And the result will be something like this (my machine is x86 so the syscall entry functions have different names):

    # tracer: function
    #
    # entries-in-buffer/entries-written: 2/2   #P:20
    #
    #                                _-----=> irqs-off
    #                               / _----=> need-resched
    #                              | / _---=> hardirq/softirq
    #                              || / _--=> preempt-depth
    #                              ||| /     delay
    #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
    #              | |         |   ||||      |         |
               a.out-6396    [000] ....  1455.370160: do_dup2 <-__x64_sys_dup2
               a.out-6396    [000] ....  1455.370174: <stack trace>
     => 0xffffffffc322006a
     => do_dup2
     => __x64_sys_dup2
     => do_syscall_64
     => entry_SYSCALL_64_after_hwframe
    

    I can now see the entire call chain that led to do_dup2 (printed in reverse order above) and then do another normal trace from one of the parent functions (remember to disable options/func_stack_trace first).