Search code examples
clinuxforksystem-callsglibc

How to implement another variation of clone(2) syscall in linux kernel?


I'm trying to create another version of clone(2) syscall(in kernel space) to create a clone of a user process with some additional parameters.This system call will be doing exactly the same job as clone(2) but I want to pass one additional parameters to the kernel from user_space.However when I see the glibc's code it seems that every parameter are not passed in the same order as user's call of the clone()

int clone(int (*fn)(void *), void *child_stack,
             int flags, void *arg, ...
             /* pid_t *ptid, void *newtls, pid_t *ctid */ );

rather some of them are handled by glibc's code itself.I searched the internet to learn how glib's clone() works but couldn't find any better documentation. Can anyone please explain

  1. How glibc handles the clone()?
  2. And also all the parameters of syscall in kernel are not exactly the same as clone in glibc, so how is these variation handled?

Solution

  • How glibc handles the clone()?

    Via arch-specific assembly wrappers. For i386, see sysdeps/unix/sysv/linux/i386/clone.S in the glibc sources; for x86-64, see sysdeps/unix/sysv/linux/x86-64/clone.S, and so on.

    The normal syscall wrappers are not sufficient, because it is up to the userspace to switch stacks. The above assembly files have pretty informative comments as to what actually needs to be done in userspace in addition to the syscall.


    All the parameters of syscall in kernel are not exactly the same as clone in glibc, so how is these variation handled?

    C library functions that map to a syscall are wrapper functions.

    Consider, for example, the POSIX.1 write() C library low-level I/O function, and the Linux write() syscall. The parameters are basically the same, as are the error conditions, but the error return values differ. The C library function returns -1 with errno set if an error occurs, whereas the Linux syscall returns negative error codes (which basically match errno values).

    If you look at e.g. sysdeps/unix/sysv/linux/x86_64/sysdep.h, you can see that the basic syscall wrapper for Linux on x86-64 boils down to

    # define INLINE_SYSCALL(name, nr, args...) \
      ({                                       \
        unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args);        \
        if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (resultvar, )))            \
          {                                                                       \
            __set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, ));                   \
            resultvar = (unsigned long int) -1;                                   \
          }                                                                       \
        (long int) resultvar; })
    

    which just calls the actual syscall, then checks if the syscall return value indicated an error; and if it does, changes the result to -1 and sets errno accordingly. It's just funny-looking, because it relies on GCC extensions to make it behave as a single statement.


    Let's say you added a new syscall to Linux, say

    SYSCALL_DEFINE2(splork, unsigned long, arg1, void *, arg2);
    

    and, for whatever reasons, you wish to expose it to userspace as

    int splork(void *arg2, unsigned long arg1);
    

    No problem! All you need is to provide a minimal header file,

    #ifndef _SPLORK_H
    #define _SPLORK_H
    #define _GNU_SOURCE
    #include <sys/syscall.h>
    #include <errno.h>
    
    #ifndef __NR_splork
    #if defined(__x86_64__)
    #define __NR_splork /* syscall number on x86-64 */
    #else
    #if defined(__i386)
    #define __NR_splork /* syscall number on i386 */
    #endif
    #endif
    
    #ifdef __NR_splork
    #ifndef SYS_splork
    #define SYS_splork __NR_splork
    #endif
    
    int splork(void *arg2, unsigned long arg1)
    {
        long retval;
    
        retval = syscall(__NR_splork, (long)arg1, (void *)arg2);
        if (retval < 0) {
            /* Note: For backward compatibility, we might wish to use
                         *(__errno_location()) = -retval;
                     here. */
            errno = -retval;
            return -1;
        } else
            return (int)retval;
    }
    
    #else
    #undef SYS_splork
    
    int splork(void *arg2, unsigned long arg1)
    {
        /* Note: For backward compatibility, we might wish to use
                     *(__errno_location()) = ENOTSUP;
                 here. */
        errno = ENOTSUP;
        return -1;
    }
    
    #endif
    
    #endif /* _SPLORK_H */
    

    The SYS_splork and __NR_splork are preprocessor macros defining the syscall number for the new syscall. Since the syscall number is likely not (yet?) included in the official kernel sources and headers, the above header file explicitly declares it for each supported architecture. For architectures where it is not supported, the splork() function will always return -1 with errno == ENOTSUP.

    Note, however, that Linux syscalls are limited to 6 parameters. If your kernel function needs more, you need to pack the parameters into a structure, pass the address of that structure to the kernel, and use copy_from_user() to copy the values to the same structure in-kernel.

    In all Linux architectures, pointers and long are of the same size (int may be smaller than pointer), so I recommend you use either long or fixed-size types in such structures to pass data to/from the kernel.