We have a legacy embedded system which uses SDL to read images and fonts from an NFS share.
If there's a network problem, TTF_OpenFont() and IMG_Load() hang essentially forever. A test application reveals that open() behaves in the same way.
It occurred to us that a quick fix would be to call alarm() before the calls which open files on the NFS share. The man pages weren't entirely clear whether open() would fail with EINTR when interrupted by SIGALRM, so we put together a test app to verify this approach. We set up a signal handler with sigaction::sa_flags set to zero to ensure that SA_RESTART was not set.
The signal handler was called, but open() was not interrupted. (We observed the same behaviour with SIGINT and SIGTERM.)
I suppose the system treats open() as a "fast" operation even on "slow" infrastructure such as NFS.
Is there any way to change this behaviour and allow open() to be interrupted by a signal?
The man pages weren't entirely clear whether open() would fail with EINTR when interrupted by SIGALRM, so we put together a test app to verify this approach.
open(2)
is a slow syscall (slow syscalls are those that can sleep forever, and can be awaken when, and if, a signal is caught in the meantime) only for some file types. In general, opens that block the caller until some condition occurs are usually interruptible. Known examples include opening a FIFO (named pipe), or (back in the old days) opening a physical terminal device (it sleeps until the modem is dialed).
NFS-mounted filesystems probably don't cause open(2)
to sleep in an interruptible state. After all, you are most likely opening a regular file, and in that case open(2)
will not be interruptable.
Is there any way to change this behaviour and allow open() to be interrupted by a signal?
I don't think so, not without doing some (non-trivial) changes to the kernel.
I would explore the possibility of using setjmp(3)
/ longjmp(3)
(see the manpage if you're not familiar; it's basically non-local gotos). You can initialize the environment buffer before calling open(2)
, and issue a longjmp(3)
in the signal handler. Here's an example:
#include <stdio.h>
#include <stdlib.h>
#include <setjmp.h>
#include <unistd.h>
#include <signal.h>
static jmp_buf jmp_env;
void sighandler(int signo) {
longjmp(jmp_env, 1);
}
int main(void) {
struct sigaction sigact;
sigact.sa_handler = sighandler;
sigact.sa_flags = 0;
sigemptyset(&sigact.sa_mask);
if (sigaction(SIGALRM, &sigact, NULL) < 0) {
perror("sigaction(2) error");
exit(EXIT_FAILURE);
}
if (setjmp(jmp_env) == 0) {
/* First time through
* This is where we would open the file
*/
alarm(5);
/* Simulate a blocked open() */
while (1)
; /* Intentionally left blank */
/* If open(2) is successful here, don't forget to unset
* the alarm
*/
alarm(0);
} else {
/* SIGALRM caught, open(2) canceled */
printf("open(2) timed out\n");
}
return 0;
}
It works by saving the context environment with the help of setjmp(3)
before calling open(2)
. setjmp(3)
returns 0 the first time through, and returns whatever value was passed to longjmp(3)
otherwise.
Please be aware that this solution is not perfect. Here are some points to keep in mind:
alarm(2)
and the call to open(2)
(simulated here with while (1) { ... }
) where the process may be preempted for a long time, so there is a chance the alarm expires before we actually attempt to open the file. Sure, with a large timeout such as 2 or 3 seconds this will most likely not happen, but it's still a race condition.SIGALRM
is caught. Using longjmp(3)
inside the signal handler will destroy the execution context of these other signal handlers, and depending on what they were doing, very nasty things may happen (inconsistent state if the signal handlers were manipulating other data structures in the program, etc.). It's as if it started executing, and suddenly crashed somewhere in the middle. You can fix it by: a) carefully setting up all signal handlers such that SIGALRM
is blocked before they are invoked (this ensures that the SIGALRM
handler does not begin execution until other handlers are done) and b) blocking these other signals before catching SIGALRM
. Both actions can be accomplished by setting the sa_mask
field of struct sigaction
with the necessary mask (the operating system atomically sets the process's signal mask to that value before beginning execution of the handler and unsets it before returning from the handler). OTOH, if the rest of the code doesn't catch signals, then this is not a problem.sleep(3)
may be implemented with alarm(2)
, and alarm(2)
and setitimer(2)
share the same timer; if other portions in the code make use of any of these functions, they will interfere and the result will be a huge mess.Just make sure you weigh in these disadvantages before blindly using this approach. The use of setjmp(3)
/ longjmp(3)
is usually discouraged and makes programs considerably harder to read, understand and maintain. It's not elegant, but right now I don't think you have a choice, unless you're willing to do some core refactoring in the project.
If you do end up using setjmp(3)
, then at the very least document these limitations.