Search code examples
clinuxsystem-callscgroups

Are write failures to cgroup tasks deterministically non-persistent?


Consider the following program.

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>

void
setup() {
    system("mkdir /sys/fs/cgroup/cpuset/TestingCpuset");
    system("echo 0,1 > /sys/fs/cgroup/cpuset/TestingCpuset/cpuset.cpus");
    system("echo 0 > /sys/fs/cgroup/cpuset/TestingCpuset/cpuset.mems");
}

int
main() {
    setup();
    // Picked to be the pid of a ordinary thread or process on the currently
    // running system.
    const char* validPid = "30100";
    const char* invalidPid = "2";
    const char* taskPath = "/sys/fs/cgroup/cpuset/TestingCpuset/tasks";
    int fd = open(taskPath, O_WRONLY);
    if (fd < 0) {
        fprintf(stderr, "Failed to open %s; errno %d: %s\n", taskPath, errno,
                strerror(errno));
    }
    int retVal = write(fd, invalidPid, strlen(invalidPid));
    if (retVal < 0) {
        fprintf(stderr, "Invalid write of %s to fd %d; errno %d: %s\n",
                invalidPid, fd, errno, strerror(errno));
    }

    retVal = write(fd, validPid, strlen(validPid));
    if (retVal < 0) {
        fprintf(stderr, "Invalid write of %s to fd %d; errno %d: %s\n",
                validPid, fd, errno, strerror(errno));
    }
}

The output of this program (run under sudo) is:

Invalid write of 2 to fd 3; errno 22: Invalid argument

Note that the subsequent write does not fail; the first write failure did not induce a failure of the next write.

Is this lack of failure persistence deterministic and reliable?

I've looked at the write man page, but it doesn't say anything about persistence of failures.


Solution

  • There is no error state generally associated with file descriptors in Linux. See below for links.

    But, before we go any further, please do not check for errors using < 0. If an error occurred (for open() or write(), the return value is -1. If a write() succeeded, it returns the number of chars written. Even for sysfs writes, you really should check that. There has been exactly one filesystem/kernel bug (or bug "family") where read()/write() returned a negative value other than -1 (and did not actually indicate an error, but a wrapped unsigned integer success value, for very large writes to ordinary files), and due to that, the kernel now caps all reads/writes to slightly less than 2 GiB. If everyone had checked errors using < 0, we would not have caught that at all.

    It is better to be slightly paranoid and catch unexpected errors, rather than assume and potentially silently lose data, in my opinion.


    Is this lack of failure persistence deterministic and reliable?

    For kernel pseudofiles under /sys/, the answer is yes: each write is considered a separate operation. Previous reads from, or writes to, the same descriptor do not affect the outcome of the current write.

    Writes to sysfs pseudofiles simply invoke the store() method of the tunable represented by the pseudofile; see fs/sysfs/file.c:sysfs_kf_bin_write(). There is no state recorded at all.

    (We could discuss whether a tunable could record previous assignment attempts and change its behaviour based on that, but let's just say that Linus Torvalds would not knowingly let that kind of thing "fly" at all.)


    In general, the Linux kernel does not store any error state in the file description. If we look at fs/read_write.c:write() (look for SYSCALL_DEFINE3(write,), we can see that the write() syscall in current kernels invokes the ksys_write(), which verifies the descriptor is valid (returning EBADF error otherwise), and invokes vfs_write(). (It should be noted that if that succeeds, the file position related to the descriptor is updated using file_pos_write(); the file position is not updated atomically. Therefore, multithreaded concurrent writes to the same file descriptor in Linux should use pwrite() or pwritev() rather than write(), to avoid the race window wrt. file position update).

    Anyway, vfs_write() does some error checking (EBADF, EINVAL, EFAULT) and bookkeeping, and calls __vfs_write(), which is a wrapper function that calls the appropriate filesystem-specific function, either file->fop->write() or file->fop->write_iter().

    (We can also take a look at fs/file_table.c for how the Linux kernel manages its internal file descriptor table (per userspace process), include/linux/fdtable.h:struct fdtable for the descriptor table itself, and at include/linux/fs.h:struct file for the definition of Linux file description. There are no members in any of these structures related to "error state" at all. However, it is useful to note the f_op member in struct file: the member is a pointer to a struct file_operations structure, which contains the per-filesystem handlers for basic file operations related to this particular open file (see include/linux/fs.h:struct file_operations).)

    (Note that in Linux, syscalls return a single integer. For error conditions, this integer contains the negative error number. Zero and positive values are considered success. The C library maintains errno completely in userspace. If you use syscall(), you need to detect error conditions and optionally maintain errno as needed yourself. So, when you see a kernel syscall returning say -EINVAL, it means it returns error EINVAL to userspace. The C library is responsible for converting that to -1 with errno == EINVAL.)

    Again, no error state is recorded in the descriptor, and each operation occurs on its own, not associated with previous operations (other than file position, which itself is not atomically updated as of this writing). Some filesystems could theoretically keep track of the operations, and maintain an internal error state associated with the descriptor, but again, unless that is a well-documented feature of the filesystem other implementations honor, it is unlikely Linux kernel developers would actually allow such a thing.


    It is important to realize that there are two key principles the Linux kernel developers must follow (because Linus enforces it): public kernel interfaces (syscalls, /proc and /sys pseudofiles) are stable and compatible across kernel versions (see this LKML message); and sane practice trumps theory, even if mandated by some standard. See for example Torvalds' Wikiquotes, or his posts on the Linux Kernel mailing list (marc.info mirror; lkml.org here).

    The reason I trust his opinion is, as he himself has said, "because they know they don't have to". I (try to) do that myself, and that is why this answer contains hopefully sufficient references so that you can verify for yourself.