Search code examples
linux-kernelsystem-callstracepoint

In a Linux system call, are system call parameters preserved in registers after the syscall finished (at the sys_exit tracepoint)?


Is it guaranteed to be able to read all the syscall parameters at sys_exit tracepoint?

sysdig driver is a kernel module to capture syscall using kernel static tracepoint. In this project some of system call parameters are read at sys_enter tracepoint, and some other parameters are read at sys_exit (return value of course, and contents in userspace to avoid pagefault).

Why not read all parameters at sys_exit? Is this because some parameters may be not be available at sys_exit?


Solution

  • Is it guaranteed to be able to read all the syscall parameters at sys_exit tracepoint?

    Yes... and no, we need to distinguish parameters from registers. Linux syscalls should preserve all general purpose userspace registers, except the register used for the return value (and on some architectures also a second register to indicate if an error occurred). However, this does not mean that the input parameters of the syscall cannot change between entry and exit: if a register holds the value of a pointer to some data, while the register itself does not change, the data it points to could very well change.

    Looking at the code for the static tracepoint sys_exit, you can see that only the syscall number (id) and its return value (ret) are traced. See note at the bottom of my answer for more.

    Why not read all parameters at sys_exit? Is this because some parameters may be not available at sys_exit?

    Yes, I would say that ensuring the correctness of the traced parameters is the main reason why tracing only at the exit would be a bad idea. Even if you get the values of the register, you cannot know the real parameters at syscall exit. Even if a syscall per se is guaranteed to save and restore the state of user registers, the syscall itself can alter the data that is being passed as argument. For example, the recvmsg syscall takes a pointer to a struct msghdr in memory which is used both as an input and an output parameter; the poll syscall does the same with a pointer to struct pollfd. Furthermore, another thread or program could have very well modified the memory of the program while it was making a syscall, therefore altering the data.

    Under specific circumstances a syscall can also take a very long time before returning (think for example of a sleep, or a blocking read on your terminal, an accept on a listening socket, etc). If you only trace at the exit, you will have very incorrect timing information, and most importantly you will have to wait a lot before any meaningful information can be captured, even though that information is already available at the entry point.


    Note on sys_exit tracepoint

    Although you could thecnically extract the values of the saved registers of the current task, I am not entirely sure about the semantics of doing so while in the sys_exit tracepoint. I searched for some documentation on this specific case, but had no luck, and kernel code is well... complex.

    The chain of calls to reach the exit hook should be:

    If a deadly signal is delivered to a process during a syscall, while the actual process will never reach the exit of the syscall (i.e. no value is ever returned to user space), the tracepoint will still be hit. When a signal delivery of this kind happens, a special internal return value is used, like -ERESTARTSYS (see here). This value is not an actual syscall return value (it is not returned to user space), but rather it is only meant to be used by kernel. So it looks like the sys_exit tracepoint is being hit with the special -ERESTARTSYS if a deadly signal is received by the process. This does not happen for example in the case of SIGSTOP + SIGCONT. Take this with a grain of salt though, since I was not able to find proper documentation for this.