Search code examples
cpthreadsshared-memorymultiprocessfutex

Shared pthread_cond_broadcast stuck in futex_wait


I have one "server" process a and potentially multiple "client" processes b. The server creates a shared memory file (shm_open) containing a pthread_mutex_t and a pthread_cond_t that it uses for broadcasting to the clients that something has happned (see the minimal example below).

At first this works fine as expected, supporting an arbitrary number of clients, but after the first client gets killed (e.g. using CTRL+C) while waiting for the broadcast, the server sometimes gets stuck in pthread_cond_broadcast, or to be more percise inside futex_wait according to gdb.

Why? And how should this be done correctly?

I've tried with and without holding the mutex and with and without a mutex after finding some discussions about this. Everything has the same behaviour.

The code to reproduce:

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>

struct {
    pthread_cond_t cond;
    pthread_mutex_t mutex;
} *shm;

void a() {
    // create shm and broadcast every second
    int shm_fd = shm_open("/my_shm", O_CREAT | O_RDWR, 0666);
    ftruncate(shm_fd, sizeof(*shm));
    shm = mmap(0, sizeof(*shm), PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
    close(shm_fd);

    pthread_mutexattr_t mutexattr;
    pthread_mutexattr_init(&mutexattr);
    pthread_mutexattr_setpshared(&mutexattr, PTHREAD_PROCESS_SHARED);
    pthread_mutex_init(&shm->mutex, &mutexattr);
    pthread_mutex_consistent(&shm->mutex);

    pthread_condattr_t condattr;
    pthread_condattr_init(&condattr);
    pthread_condattr_setpshared(&condattr, PTHREAD_PROCESS_SHARED);
    pthread_cond_init(&shm->cond, &condattr);

    for (int i = 0; 1; ++i) {
        pthread_mutex_lock(&shm->mutex);
        pthread_cond_broadcast(&shm->cond);
        pthread_mutex_unlock(&shm->mutex);
        sleep(1);
        printf("broadcast %d\n", i);
    }
}

void b() {
    // open shm and listen for events
    int shm_fd = shm_open("/my_shm", O_RDWR, 0666);
    shm = mmap(0, sizeof(*shm), PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
    close(shm_fd);
    for (int i = 0; 1; ++i) {
        pthread_mutex_lock(&shm->mutex);
        pthread_cond_wait(&shm->cond, &shm->mutex);
        pthread_mutex_unlock(&shm->mutex);
        printf("receive %d\n", i);
    }
}

int main(int argc, char** argv) {
    if (argc != 2)
        return -1;
    switch (argv[1][0]) {
    case 'a':
        a();
        break;
    case 'b':
        b();
        break;
    default:
        return -1;
    }
    return 0;
}

Compile with gcc ab.c -o ab -lpthread -lrt, then run

./ab a &
./ab b
CTRL+C
./ab b

Sometime between the CTRL+C and ./ab b the server will stop outputting broadcast.


Solution

  • [...] after the first client gets killed (e.g. using CTRL+C) while waiting for the broadcast, the server sometimes gets stuck in pthread_cond_broadcast [...]

    Why?

    Because killing the process may leave the CV and / or mutex in an inconsistent state. The same general thing can happen when one thread of a multithreaded process is forcibly killed, or when a multithreaded process forks. Indeed, given that the b processes spend most of their time waiting on the CV, it is pretty likely that they leave that inconsistent when they get terminated by a signal.

    And how should this be done correctly?

    To prevent the CV becoming inconsistent under such circumstances, you should ensure -- to the extent that it is possible -- that the b processes do not terminate while waiting on the CV. To protect them against that happening as a result of receiving a signal, set up a handler for the signal that raises a flag (of type sig_atomic_t). The process would then checks that flag after returning from the wait to determine whether it needs to terminate. Conceivably, you could also broadcast to the CV to ensure that the process proceeds with the termination as soon as possible.

    Do note, however, that some signals cannot be caught or blocked, and the above approach cannot do anything about those. Some other signals can be caught, but obligate the handler to terminate the program to avoid undefined behavior, and the above approach doesn't help with those, either.

    Additionally, there are other issues with your code, including

    • you do not check the return values of your function calls, apparently assuming that they always succeed.

    • you seem to have completely the wrong idea about the semantics of pthread_mutex_consistent():

      1. It is applicable only to robust mutexes, which yours are not configured to be.
      2. It is appropriate to call that function only after pthread_mutex_lock() indicates via its return value that the mutex is inconsistent, and after taking any action necessary to make program state guarded by the mutex consistent.
      3. Contrary to your claim in the comments, pthread_mutex_consistent() does not unlock the mutex. It just marks the mutex as having been returned to consistency. The mutex must still be unlocked before other threads can acquire it.
      4. Only the first thread / process to lock the mutex after it becomes inconsistent has the opportunity to make it consistent again. Thus, if you want to use a robust mutex in the example program then the a and b processes will both need to be prepared to handle inconsistent mutexes, and that at each point where they acquire the mutex.
      5. And since one place the b processes acquire the mutex is inside pthread_cond_wait(), and it does not have a documented mechanism to report on that event, robust mutexes probably are not a viable option for you.