Search code examples
linuxhigh-availabilityfailoverheartbeat

High availability computing: How to deal with a non-returning system call, without risking false positives?


I have a process that's running on a Linux computer as part of a high-availability system. The process has a main thread that receives requests from the other computers on the network and responds to them. There is also a heartbeat thread that sends out multicast heartbeat packets periodically, to let the other processes on the network know that this process is still alive and available -- if they don't heart any heartbeat packets from it for a while, one of them will assume this process has died and will take over its duties, so that the system as a whole can continue to work.

This all works pretty well, but the other day the entire system failed, and when I investigated why I found the following:

  1. Due to (what is apparently) a bug in the box's Linux kernel, there was a kernel "oops" induced by a system call that this process's main thread made.
  2. Because of the kernel "oops", the system call never returned, leaving the process's main thread permanently hung.
  3. The heartbeat thread, OTOH, continue to operate correctly, which meant that the other nodes on the network never realized that this node had failed, and none of them stepped in to take over its duties... and so the requested tasks were not performed and the system's operation effectively halted.

My question is, is there an elegant solution that can handle this sort of failure? (Obviously one thing to do is fix the Linux kernel so it doesn't "oops", but given the complexity of the Linux kernel, it would be nice if my software could handle future other kernel bugs more gracefully as well).

One solution I don't like would be to put the heartbeat generator into the main thread, rather than running it as a separate thread, or in some other way tie it to the main thread so that if the main thread gets hung up indefinitely, heartbeats won't get sent. The reason I don't like this solution is because the main thread is not a real-time thread, and so doing this would introduce the possibility of occasional false-positives where a slow-to-complete operation was mistaken for a node failure. I'd like to avoid false positives if I can.

Ideally there would be some way to ensure that a failed syscall either returns an error code, or if that's not possible, crashes my process; either of those would halt the generation of heartbeat packets and allow a failover to proceed. Is there any way to do that, or does an unreliable kernel doom my user process to unreliability as well?


Solution

  • My second suggestion is to use ptrace to find the current instruction pointer. You can have a parent thread that ptraces your process and interrupts it every second to check the current RIP value. This is somewhat complex, so I've written a demonstration program: (x86_64 only, but that should be fixable by changing the register names.)

    #define _GNU_SOURCE
    #include <unistd.h>
    #include <sched.h>
    #include <stdlib.h>
    #include <stdio.h>
    #include <sys/syscall.h>
    #include <sys/ptrace.h>
    #include <sys/wait.h>
    #include <sys/types.h>
    #include <linux/ptrace.h>
    #include <sys/user.h>
    #include <time.h>
    
    // this number is arbitrary - find a better one.
    #define STACK_SIZE (1024 * 1024)
    
    int main_thread(void *ptr) {
        // "main" thread is now running under the monitor
        printf("Hello from main!");
        while (1) {
            int c = getchar();
            if (c == EOF) { break; }
            nanosleep(&(struct timespec) {0, 200 * 1000 * 1000}, NULL);
            putchar(c);
        }
        return 0;
    }
    
    int main(int argc, char *argv[]) {
        void *vstack = malloc(STACK_SIZE);
        pid_t v;
        if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
            perror("failed to spawn child task");
            return 3;
        }
        printf("Target: %d; %d\n", v, getpid());
        long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
        if (ptv == -1) {
            perror("failed monitor sieze");
            exit(1);
        }
        struct user_regs_struct regs;
        fprintf(stderr, "beginning monitor...\n");
        while (1) {
            sleep(1);
            long ptv = ptrace(PTRACE_INTERRUPT, v, NULL, NULL);
            if (ptv == -1) {
                perror("failed to interrupt main thread");
                break;
            }
            int status;
            if (waitpid(v, &status, __WCLONE) == -1) {
                perror("target wait failed");
                break;
            }
            if (!WIFSTOPPED(status)) { // this section is messy. do it better.
                fputs("target wait went wrong", stderr);
                break;
            }
            if ((status >> 8) != (SIGTRAP | PTRACE_EVENT_STOP << 8)) {
                fputs("target wait went wrong (2)", stderr);
                break;
            }
            ptv = ptrace(PTRACE_GETREGS, v, NULL, &regs);
            if (ptv == -1) {
                perror("failed to peek at registers of thread");
                break;
            }
            fprintf(stderr, "%d -> RIP %x RSP %x\n", time(NULL), regs.rip, regs.rsp);
            ptv = ptrace(PTRACE_CONT, v, NULL, NULL);
            if (ptv == -1) {
                perror("failed to resume main thread");
                break;
            }
        }
        return 2;
    }
    

    Note that this is not production-quality code. You'll need to do a bunch of fixing things up.

    Based on this, you should be able to figure out whether or not the program counter is advancing, and could combine this with other pieces of information (such as /proc/PID/status) to find if it's busy in a system call. You might also be able to extend the usage of ptrace to check what system calls are being used, so that you can check if it's a reasonable one to be waiting on.

    This is a hacky solution, but I don't think that you'll find a non-hacky solution for this problem. Despite the hackiness, I don't think (this is untested) that it would be particularly slow; my implementation pauses the monitored thread once per second for a very short amount of time - which I would guess would be in the 100s of microseconds range. That's around 0.01% efficiency loss, theoretically.