Search code examples
clinuxsignalsfork

What's wrong with sigwait while waiting signals from multiple processes


I have multi-processes in this case 3, 1 parent 2 children. I expect each child goes on after returning from the their signal handlers. But some times they are stuck, sometimes only one of them goes on. What's my mistake? Actually I'm trying to do something similar to waking-up with signal. I can have more signal to do that but what if 100 children are needed? So, I want to achieve that only by using SIGUSR2

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>


void tellerHandler(int sig) {
    //write(STDERR_FILENO, "Teller has caught SIGUSR2 signal\n", 33);
    printf("pid %u Teller has caught SIGUSR2 signal\n", getpid());
}

int main() {

    int NUM_OF_CHILDREN = 2;
    struct sigaction sa;
    memset(&sa, 0, sizeof(sa));
    sa.sa_flags = 0;
    sa.sa_handler = tellerHandler;
    if (sigaction(SIGUSR2, &sa, NULL) == -1) {
        perror("sigaction error");
        exit(-1);
    }

    sigset_t new_mask;
    sigfillset(&new_mask);
    sigdelset(&new_mask, SIGUSR2);

    int returnedPid = -1;
    pid_t pidList[NUM_OF_CHILDREN];

    for (int i = 1; i < 1 + NUM_OF_CHILDREN; ++i) {
        if ((returnedPid = fork()) == 0) {
            break;
        } else {
            pidList[i - 1] = returnedPid;
        }
    }

    if (returnedPid == 0) {
        sigsuspend(&new_mask);
        printf("child %u returned from handler\n", getpid());
    } else {
        for (int i = 0; i < NUM_OF_CHILDREN; ++i) {
            printf("child %u\n", pidList[i]);
            kill(pidList[i], SIGUSR2);
        }

        for (int i = 0; i < NUM_OF_CHILDREN; ++i) {
            waitpid(pidList[i], 0, 0);
        }
        puts("parent exiting...\n");
    }

    puts("donee");
}

Some different outputs with several runs,

child 66313
child 66314
pid 66313 Teller has caught SIGUSR2 signal
child 66313 returned from handler
pid 66314 Teller has caught SIGUSR2 signal

---

child 66330
child 66331
pid 66330 Teller has caught SIGUSR2 signal
pid 66331 Teller has caught SIGUSR2 signal

---

When I increase children number to 3 the following output is like

child 66738
child 66739
child 66740
pid 66739 Teller has caught SIGUSR2 signal
pid 66738 Teller has caught SIGUSR2 signal
child 66738 returned from handler
donee
pid 66740 Teller has caught SIGUSR2 signal

Solution

  • Your basic problem here is a race condition between the parent sending the SIGUSR2 signal and the child calling sigsuspend. If the child is slow(er) to start and the parent runs first, it may send the signal BEFORE the child ever calls sigsuspend. Since the child starts (returns from fork()) with the signal handler active and unmasked, it may catch the signal (and print the message about catching it) right away, and then return and call sigsuspend. Since by this point the signal has already been handled, sigsuspend will wait for a second signal that never comes.

    The fix is to ensure that SIGUSR2 is blocked in the child until it calls sigsuspend. Put code to do that BEFORE the loop that calls fork:

    sigset_t ss;
    sigemptyset(&ss);
    sigaddset(&ss, SIGUSR2);
    sigprocmask(SIG_BLOCK, &ss, 0);