C++17/Linux: signal not unblocking blocked network socket calls in separate thread

I have a multi-threaded application where the main thread spawns multiple (3+) threads, each tasked with performing something different. One of the threads is supposed to run a simple TCP server which would accept a single connection at a time and receive data from it.

The application catches and handles SIGTERM so it can coordinate proper cleanup between threads. Upon receiving this signal, it simply sets a global shared kill flag (type std::atomicstd::bool) to true.

Now in the original design, the main thread performed server duties. When it would receive SIGTERM, the accept() or recv() call that it was in, would return EINTR and the application could detect this and know that it was time to join the other threads.

I am trying to re-work this to have one of the spawned threads act as the server. While the server functionality does indeed work properly, signal handling does not. When the application receives SIGTERM, the kill flag is set but the server thread continues to block on accept() or recv(), depending on whether it has a connected client already.

Upon investigating the issue, I've learned:

signals may be sent to any available thread in the process
each thread may have it's own signal mask to block reception of certain signals

It's also become apparent to me from the issue I'm having that unless the particular thread that is blocked on an accept()/recv() catches the signal, the thread will continue to be blocked as the blocking function will not return EINTR.

Questions:

Why did the original design (where the main thread was the server) work every single time that I tested it? And I've tested it hundreds of times. I would imagine that at some point one of the other threads would have received the signal and the main thread would have continued to block. Why did that never happen?
What do you recommend to rectify this issue? I would like to continue having the server operate in a spawned thread and not in the main one. Here are some solutions I've read about:

a. Spawn a separate signal-catcher thread and block the signals in all of the other threads. I'm not sure how this would help my situation since if the server thread is blocked on a syscall and has masked out all signals, there would be no way to signal it to wake up and start cleanup. I've read about condition variables but, again, I don't see how that would unblock the blocked syscall.

b. Switch to using non-blocking sockets and write the server using select()/poll()/epoll(). I can see this working though it seems overkill for a server that will only process one client at a time. However, I am willing to do it if it is the best solution. But does this mean that all of the spawned threads are effectively prohibited from using blocking syscalls? Another of the threads which is not yet written is supposed to do some serial I/O. Does that also need to be written using these multiplexing functions?

Is there a way that 2a can work in my case or is 2b the only solution for me?

Restrictions: this project is using C++17 and my group is not allowed to use any libraries outside what's available on our (fairly standard) Linux systems and the C++ standard library. Boost and other 3rd party code are not options for us. We are also using pthreads directly and not through the C++ STL but I don't think that should affect this situation.

I have not yet tried implementing any solution as I am researching which one would be most optimal for my situation.

Solution

Background

Upon investigating the issue, I've learned:

signals may be sent to any available thread in the process

Yes. And also, signals sent to the process overall will be delivered to at most one of its threads. These are two separate concepts, albeit closely related.

each thread may have it's own signal mask to block reception of certain signals

Yes, though it is more correct to say that each thread does have its own signal mask. Even if a thread's signal mask does not block any signals, it's still there.

It's also become apparent to me from the issue I'm having that unless the particular thread that is blocked on an accept()/recv() catches the signal, the thread will continue to be blocked as the blocking function will not return EINTR.

Yes. A function call failing with EINTR means that execution of that call was interrupted to allow the calling thread to run a signal handler, and that either the handler was installed without the SA_RESTART option or that the function in question is not among those supporting restarting when interrupted. A signal being handled by a signal handler has no direct effect on threads other than the one handling the signal.

The original design

Why did the original design (where the main thread was the server) work every single time that I tested it? And I've tested it hundreds of times. I would imagine that at some point one of the other threads would have received the signal and the main thread would have continued to block. Why did that never happen?

There are no specifications for how the thread to handle a given signal is chosen from among those that do not have it blocked. There is no requirement that the choice be non-deterministic, such that the number of times you tested has much relevance. With that said, there is likely at least one race in your original code such that if a SIGTERM arrived at exactly the wrong time then the application would not shut down cleanly, but that could easily be a one-in-a-million level event.

Other than that, it is plausible that to choose a thread to handle an incoming signal, the system goes goes through the eligible threads in a fixed order, possibly related to the order in which they were created. It is also plausible that threads currently blocked on I/O operations are selected preferentially when they are available. These are examples of system implementation details that would explain why your old design operated well.

New design options

What do you recommend to rectify this issue? I would like to continue having the server operate in a spawned thread and not in the main one.

There are several distinct aspects to this, and you don't seem to be clearly delineating among them. The application must

Receive the SIGTERM or other termination signal.
Ensure that all threads are instructed to terminate, including
Make sure that all threads notice the instruction to shut down in a timely manner.

To customize program behavior upon receipt of a signal, you need to install a custom signal handler. This much you have done.

You are presently using an atomic variable as a flag to inform your threads that a shutdown has been requested. This is reasonable.

The main issue you are struggling with is (3), how to ensure that all threads notice the shutdown request in a timely manner. As you discovered, at most one thread's blocking syscall will be interrupted, so system calls that block indefinitely are an issue that needs to be addressed.

Here are some solutions I've read about:

Designated signal-catching thread

a. Spawn a separate signal-catcher thread and block the signals in all of the other threads.

It is useful to designate a specific thread to receive any inbound SIGTERM because that gives you a way to do most of the response in ordinary code instead of subject to the restrictions on signal handler behavior. That's more of an advantage than you may appreciate, but it's a solution to a different problem than you asked about.

Notifying other threads via signals

I'm not sure how this would help my situation since if the server thread is blocked on a syscall and has masked out all signals, there would be no way to signal it to wake up and start cleanup.

It does not need to be the case that threads other than the SIGTERM-handler have all signals blocked. There may be others that they should block, too, but there can be some that they allow. The SIGTERM handler could alert them by sending one of those signals to each. Some reasonable choices might be SIGABRT, SIGALRM, SIGUSR1, or SIGUSR2.

Whether a designated thread is responsible for it or not, notifying other threads via signals has the advantage that it can interrupt (some) blocking syscalls, but there's a problem with that. Suppose that an application does this ...

    something_nonblocking();
    block_indefinitely();
    if (I_should_terminate()) {
        clean_up();
        return;
    }

If that thread receives the notification signal while it is executing something_nonblocking(), then that signal will not interrupt block_indefinitely(). Even if the last thing in something_nonblocking() is to check whether I_should_terminate(), it is possible for the signal to arrive after the result of that check is determined (as false) but before block_indefinitely() is entered, rendering the signal ineffective for effecting prompt, clean shutdown.

You could reduce the likelihood of this issue manifesting by signaling twice, with a delay between, but you cannot altogether eliminate it.

Notifying other threads via I/O

Another option for alerting other threads to terminate would be ...

b. Switch to using non-blocking sockets and write the server using select()/poll()/epoll(). I can see this working though it seems overkill for a server that will only process one client at a time.

I don't see why you think that's overkill. The point is to avoid being blocked on I/O preventing you from doing work that's ready to perform right away (i.e. shutting down). That's exactly the purpose these functions serve.

I suppose you may be imagining using these merely to put a timeout on blocking I/O, which could work, but if you exercise this general approach then you should consider setting up an I/O channel -- a pipe, for example -- by which the termination notice could be conveyed directly to the I/O thread(s). If that thread includes the read end of such a channel among the FDs it monitors then not only would there be genuine multiplexing, but the thread would be more responsive to shutdown notifications because it would not have to wait for a timeout to expire.

However, I am willing to do it if it is the best solution. But does this mean that all of the spawned threads are effectively prohibited from using blocking syscalls?

Sort of. Blocking syscalls would be ok to the extent that you could be confident (or at least be willing to assume) that they would not block very long. For example, I/O on a regular file on a local filesystem is ordinarily conducted via blocking operations, but only under extraordinary circumstances would such an operation block long enough to interfere with your shutdown.

Another of the threads which is not yet written is supposed to do some serial I/O. Does that also need to be written using these multiplexing functions?

Maybe. If those serial I/O operations may otherwise block longer than you're willing to wait for thread shutdown to commence, then meeting your termination-behavior goals requires bringing that under control. Gating them with one of the multiplexing functions is one way to do that. Relying on interrupting them with a signal or two might be another. I would recommend choosing one approach and using it consistently, but that's a design choice.

Overall recommendation

Do use a dedicated signal-handling thread. In particular, I would suggest that
- a do-nothing signal handler be registered for SIGTERM and any other signal you want to be handled by this thread
- those signals be blocked for all other threads via their signal masks
- all signals but those be blocked for the handler thread
- the handler waits for a signal synchronously via pause() or sigsuspend()
Do use select / poll / or epoll to gate execution of I/O operations that otherwise might block longer than you are willing to wait for shutdown to commence
Do set up a pipe or similar I/O channel that you can add to the multiplexer's monitored set, and let the signal-handling thread use that to ensure that threads notice the termination signal in a timely manner when they may need help to do so.
- when signal-handling thread detects a signal, after setting the global atomic flag to memorialize that, it writes to this channel
- other threads need only see that the channel is readable. They don't need to actually consume any data from it, and they shouldn't, so that it remains readable.