Search code examples
pythonlinux-kernelebpfbcc-bpf

ebpf kprobe argument not matching the syscall


I'm learning eBPF and I'm playing with it in order to understand it better while following the docs but there's something I don't understand why it's not working...

I have this very simple code that stops the code and returns 5.

int main() {
   exit(5);
   return 0;
}

The exit function from the code above calls the exit_group syscall as can we can see by using strace (image below) yet within my Python code that's using eBPF through bcc the output I get for my bpf_trace_printk is the value 208682672 and not the value 5 that the exit_group syscall is called with as I was expecting...

strace return

from bcc import BPF

def main():
    bpftext = """
    #include <uapi/linux/ptrace.h>

    void my_exit(struct pt_regs *ctx, int status){
        bpf_trace_printk("%d", status);
    }
    """

    bpf = BPF(text=bpftext)
    fname = bpf.get_syscall_fnname('exit_group')
    bpf.attach_kprobe(event=fname, fn_name='my_exit')

    while True:
        print(bpf.trace_fields())


if __name__ == '__main__':
    main()

I've looked into whatever I found online but I couldn't find a solution as I've been investigating this problem for a few days now...

I truly appreciate any help available and thank you!


Solution

  • Fix

    You need to rename your function from my_exit to syscall__exit_group.

    Why does this matter? BPF programs named in this way get special handling from BCC. Here's what the documentation says:

    8. system call tracepoints

    Syntax: syscall__SYSCALLNAME

    syscall__ is a special prefix that creates a kprobe for the system call name provided as the remainder. You can use it by declaring a normal C function, then using the Python BPF.get_syscall_fnname(SYSCALLNAME) and BPF.attach_kprobe() to associate it.

    Arguments are specified on the function declaration: syscall__SYSCALLNAME(struct pt_regs *ctx, [, argument1 ...]).

    For example:

    int syscall__execve(struct pt_regs *ctx,
        const char __user *filename,
        const char __user *const __user *__argv,
        const char __user *const __user *__envp)
    {
        [...]
    }
    

    This instruments the execve system call.

    Source.

    Corrected Code

    from bcc import BPF
    
    def main():
        bpftext = """
        #include <uapi/linux/ptrace.h>
    
        void syscall__exit_group(struct pt_regs *ctx, int status){
            bpf_trace_printk("%d", status);
        }
        """
    
        bpf = BPF(text=bpftext)
        fname = bpf.get_syscall_fnname('exit_group')
        bpf.attach_kprobe(event=fname, fn_name='syscall__exit_group')
    
        while True:
            print(bpf.trace_fields())
    
    
    if __name__ == '__main__':
        main()
    

    Output from the sample program exiting:

    (b'<...>', 14896, 0, b'd...1', 3996.079261, b'5')
    

    How it Works

    After BCC transforms your BPF program, this results in a slightly different interpretation of the arguments passed. You can use bpf = BPF(text=bpftext, debug=bcc.DEBUG_PREPROCESSOR) to see how your code is transformed.

    Here's what happens without the syscall__ prefix:

    void my_exit(struct pt_regs *ctx){
     int status = ctx->di;
            ({ char _fmt[] = "%d"; bpf_trace_printk_(_fmt, sizeof(_fmt), status); });
        }
    

    This reads in the RDI register and interprets it as the syscall argument.

    On the other hand, here's what happens if it's named syscall__exit_group:

    void syscall__exit_group(struct pt_regs *ctx){
    #if defined(CONFIG_ARCH_HAS_SYSCALL_WRAPPER) && !defined(__s390x__)
     struct pt_regs * __ctx = ctx->di;
     int status; bpf_probe_read(&status, sizeof(status), &__ctx->di);
    #else
     int status = ctx->di;
    #endif
    
            ({ char _fmt[] = "%d"; bpf_trace_printk_(_fmt, sizeof(_fmt), status); });
        }
    

    If the CONFIG_ARCH_HAS_SYSCALL_WRAPPER is defined (it is on x86_64) then the RDI register is interpreted as a pointer to a struct pt_regs, which looks up the RDI register in that, which is the first argument to exit_group().

    On systems without syscall wrappers, this does the same thing as the previous example.