Search code examples
clinux-kernelsystem-callsuserspace

How does the Linux kernel "listen" to the C library?


I'm trying to build up a "big picture" of how things work in the Linux kernel and userspace, and I'm quite confused. I know that userspace make use of system calls to "talk" to the kernel, but I don't know how. I tried to read the C library and kernel source codes but they are complex and not easy to understand. I've also read several books regarding conceptual facts about operating systems, like managing processes, memory, devices, but they don't make the "transition" (userspace->kernel) clear. So, where exactly the transition between the userspace and kernel space happens? How does the C library run a code that's inside the Linux kernel running in the machine?

To make an analogy: imagine that there is a house. The house is locked. The key to open the house is inside the house itself. There's only one person inside the house, the kernel. The userspace is someone trying to enter the house. My question would be: how does the kernel knows there's someone outside the house wanting the key, and which mechanism allows the house to be opened with that key?


Solution

  • That's quiet easy - the person can use the doorbell to let the kernel know it's waiting outside. And this doorbell in our case is usually a special CPU exception, software interrupt or dedicated instruction that a user-space application is allowed to use and the kernel can handle.

    So the procedure is like this:

    1. First you need to know the system call number. Each syscall has its unique number and there is a table inside of the kernel that maps those numbers to specific functions. Each architecture can have different table entries for the same number. On two different architectures the same number may map to different syscalls.

    2. Then you set up your arguments. This is also architecture specific but is not much different from passing arguments between usual function calls. Usually, you will put your arguments in specific CPU registers. This is described in the ABI of this architecture.

    3. Then you enter syscall. Depending on the architecture this may mean causing some exception or executing a dedicated CPU instruction.

    4. The kernel has special handler function that runs in kernel mode when a syscall is called. It will pause process execution, storing all the information specific to this process (this is called context switch), read the syscall number and arguments and call proper syscall routine. It will also make sure to put the return value in proper place for user-space to read and to schedule the process back when the syscall routine is done (restoring its context).

    As an example, to let the kernel know you want to call syscall on x86_64 you can use sysenter instruction with syscall number in %rax register. Arguments are passed using registers (if I remember correctly) %rdi, %rsi, %rdx, %rcx, %r8 and %r9.

    You could also use an older way that was used on 32 bit x86 CPUs - a software interrupt number 0x80 (int 0x80 instruction). Again, syscall number is specified in %rax register and arguments go to (again, if I'm not mistaken) %ebx, %ecx, %edx, %esi, %edi, %ebp.

    ARM is very similar - you will use "supervisor call" instruction (SVC #0). Your syscall number will go to r7 register, all the arguments will go to registers r0-r6 and the return value of syscall will be stored in r0.

    Other architectures and operating systems use similar techniques. The details may vary - software interrupt numbers may be different, arguments may be passed using different registers or even using stack but the core idea is the same.