Search code examples
clinuxmultithreadingpthreadssignals

What is a safe way to interrupt a process created by fork()/system() from a different thread than the one which called fork()/system()?


Context:

I am working on a system intended to update a device running linux over ethernet. The device hosts an update server, and when a client connects, the client sends an update package which is applied to the device over multiple steps by the server calling scripts via system(). During this, a 2nd thread (pthread) sends progress updates to the client and monitors the connection.

I would like the server process to die gracefully if it detects that the client disconnected, as quickly as possible (the intent being if a client disconnects, prevent the update from finalizing and relaunch the update server ASAP to try again).

Problem:

My issue with this is that if the script is one of the longer running ones, it will take a while currently for the server to finish that, in which time the client may try to connect again and fail. (2nd thread detects bad connection, sets an atomic boolean to indicate the problem, and after each system() call finishes, the boolean is checked to see if the process should continue).

Attempts:

I first tried storing the pthread TID of the main thread in a global, and when a connection fails, the second thread would execute a pthread_kill() on the parents TID sending a SIGINT and setting the boolean, which I assumed from trying on the terminal would either cause the process spawned by system() to receive the SIGINT and drop back to execution in the main thread, which would then check the bool and exit, or if it were between system() calls, its own handler for SIGINT would catch that and exit the same way. From what I can tell, this does not work because system() disables some signals in the parent process including SIGINT, and the process created by fork() in the system() call would have a different PID/TID than the main thread which called it.

This lead me to my current attempt, in which I try to recreate the function of system() in a way that the forked child PID is stored and can be interrupted from the second thread instead of the parents TID as shown in this MVE:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <signal.h>
#include <stdbool.h>
#include <sys/wait.h>
#include <stdatomic.h>

pthread_mutex_t lock;
pid_t forked_process = 0;
atomic_bool done = ATOMIC_VAR_INIT(false);

void *signal_thread(void * args){
    sleep(5);
    pthread_mutex_lock(&lock);
    if (forked_process != 0){
        kill(forked_process, SIGINT);
    }
    pthread_mutex_unlock(&lock);
    done = true;
}

int mysystem(const char * command){
    pid_t pid, res_pid;
    char *argp[] = {"sh", "-c", NULL, NULL};
    argp[2] = (char *)command;
    int status;
    
    switch (pid = fork()) {
    case -1:            /* error */
        return(-1);
    case 0:             /* child */
        execv("/bin/sh", argp);
        _exit(127);
    }
    pthread_mutex_lock(&lock);
    forked_process = pid;
    pthread_mutex_unlock(&lock);

    res_pid = waitpid(pid, &status, 0);

    pthread_mutex_lock(&lock);
    forked_process = 0;
    pthread_mutex_unlock(&lock);

    return status;
}

int main(int argc, char const *argv[])
{
    pthread_t child;
    int ret = 0;

    if (pthread_mutex_init(&lock, NULL)) {
        printf("Error: mutex init failed\n");
        return -1;
    }

    pthread_create(&child, NULL, &signal_thread, NULL);
    ret = mysystem("./script.sh");

    do {
        printf("Still in main, ret = %d\n", ret);
        sleep(1);
    } while (!done);
    return 0;
}

I understand that this approach likely has some problems of its own. I could see, for instance, a race condition where the forked child finishes, but the 2nd thread gets the mutex first and a signal is sent to a PID that is no longer valid/belongs to a different process (although I think it would be very unlikely for both the race condition AND PID reuse to occur in a way that this could happen, correct me if I am wrong). What I do not understand is why this example in its current form does not work. The correct PID (as reported by the script and thread sending kill()) is signaled, but continues to completion. This does not happen if I send SIGKILL, however this is undesirable as I would like the scripts to be able to handle the interrupt and clean up after themselves.

So to reiterate my main question, what methods should I be looking into to accomplish the desired functionality of:

  • Main thread executes scripts in a way that can be interrupted
  • Secondary thread can interrupt the script executed in main OR main itsself

I would be happy to be pointed in another direction for accomplishing this, as it seems like it is not something I can easily accomplish with my current methods, or if someone could point out a way to get my example working as expected in a way that is safe, I would be equally happy.

For completeness, the script this is executing:

#!/bin/bash

for i in {1..20}
do
   echo "Sleeping in script PID = $$"
   sleep 1
done
echo "DONE"

And it is compiled with:

gcc -pthread ./main.c -o main

Solution

  • I first tried storing the pthread TID of the main thread in a global, and when a connection fails, the second thread would execute a pthread_kill() on the parents TID sending a SIGINT [...]

    The main problem with that is not SIGINT being blocked in the process calling system(), but rather that it is the child process that needs to be signaled in that case. Raising a signal in the thread running system() does not achieve that.

    You could try kill(0, SIGINT). That will send a SIGINT to every process in the calling process's process group, which probably includes the shell launched by system() (but not necessarily processes launched in turn by that shell). It definitely includes the calling process itself, though the SIGINT will initially be blocked there. Take care, because it may include other processes as well, such as the parent of the update server.

    The combination of system("./script.sh") and kill(0, SIGINT) seems to work for me in my limited testing, but I do recommend taking care to ensure that the resulting SIGINT will not kill the the server process. I'm not sure right now why it didn't do that in my tests, but it didn't.

    Along the same lines, you could try kill(-1, SIGINT). That will send the signal to all the processes (excepting a few unspecified system processes) that the calling process is permitted to signal. That will almost surely include all the processes related to the system() call. It will include the calling process itself. And it may very well include other processes as well. When I tested this, it killed my whole login session.

    I try to recreate the function of system() in a way that the forked child PID is stored and can be interrupted from the second thread instead of the parents TID

    [...]

    I do not understand is why this example in its current form does not work.

    You seem to be running into a behavior of your /bin/sh. I find that if I modify your mysystem() function to launch the script via /bin/bash ...

            execl("/bin/bash", "bash", "-c", command, (char *) NULL);
    

    ... then it receives the SIGINT sent by the other thread and aborts. I'm unable to explain that based on the manual. I'm especially having trouble explaining why bash still ignores SIGINT when I execute the same command (/bin/bash -c ./script.sh) from an interactive shell.