Returning from a signal handler via setcontext

I'm trying to use the third argument of a SA_SIGINFO sigaction to jump to the interrupted context directly.

Thought this:

void action(int Sig, siginfo_t *Info, void *Uctx) { 
    ucontext_t *uc = Uctx; setcontext(uc); 
}

would have the same effect as just:

void action(int Sig, siginfo_t *Info, void *Uctx) { 
    return; 
}

but curiously it accepts three signals (that invoke the setcontext-calling handler), and then it segfaults in setcontext:

Dump of assembler code for function setcontext:
   0x00007ffff7a34180 <+0>:     push   %rdi
   0x00007ffff7a34181 <+1>:     lea    0x128(%rdi),%rsi
   0x00007ffff7a34188 <+8>:     xor    %edx,%edx
   0x00007ffff7a3418a <+10>:    mov    $0x2,%edi
   0x00007ffff7a3418f <+15>:    mov    $0x8,%r10d
   0x00007ffff7a34195 <+21>:    mov    $0xe,%eax
   0x00007ffff7a3419a <+26>:    syscall
   0x00007ffff7a3419c <+28>:    pop    %rdi
   0x00007ffff7a3419d <+29>:    cmp    $0xfffffffffffff001,%rax
   0x00007ffff7a341a3 <+35>:    jae    0x7ffff7a34200 <setcontext+128>
   0x00007ffff7a341a5 <+37>:    mov    0xe0(%rdi),%rcx
--Type <RET> for more, q to quit, c to continue without paging--
   0x00007ffff7a341ac <+44>:    fldenv (%rcx)
=> 0x00007ffff7a341ae <+46>:    ldmxcsr 0x1c0(%rdi)
   0x00007ffff7a341b5 <+53>:    mov    0xa0(%rdi),%rsp
   0x00007ffff7a341bc <+60>:    mov    0x80(%rdi),%rbx

and the fault address shown by strace is 0 (a catchable SIGSEGV).

Here's an example program that's using a timer to send the three signals:

#include <unistd.h>
#include <sys/time.h>
#include <ucontext.h>
#include <signal.h>
  
void action(int Sig, siginfo_t *Info, void *Uctx) { 
    ucontext_t *uc = Uctx; setcontext(uc); 
}
int main(void) {
    char ch[100];
    sigaction(SIGALRM, &(struct sigaction){.sa_sigaction = action, .sa_flags = SA_SIGINFO}, 0);
    setitimer(ITIMER_REAL, &(struct itimerval){.it_interval.tv_sec = 1,.it_value.tv_sec = 1}, 0);
    write(1, "enter\n", 6);
    for (;;) {
        write(1, "{\n", 2);
        read(0, &ch[0], sizeof(ch));
        write(1, "}\n", 2);
    }
}

What is going on in this situation?

Solution

I think this is something that's simply not meant to work: you are only supposed to call setcontext with a context obtained from getcontext or makecontext, and not with the context passed to a signal handler.

The man page hints at this obliquely:

If the context was obtained by a call to a signal handler, then old standard text says that "program execution continues with the program instruction following the instruction interrupted by the signal". However, this sentence was removed in SUSv2, and the present verdict is "the result is unspecified".

Also, the glibc source of setcontext has a comment:

This implementation is intended to be used for synchronous context switches only. Therefore, it does not have to restore anything other than the PRESERVED state.

Indeed, it does not attempt to restore any of the floating-point registers, and it zeroes rax (as for getcontext returning 0). That would be pretty bad for trying to resume code that isn't expecting its registers to change spontaneously.

Asynchronous context switching would be needed for something like preemptive multitasking in userspace. I think the idea is that since pthreads is now firmly established, people should have no need for this, so it's not supported. getcontext/setcontext date from an earlier era, and in fact have since been removed from the POSIX spec on the premise that pthreads should be used instead.

This particular crash seems to be caused by a mismatch between the kernel's layout of struct ucontext_t, and what libc expects. In particular, libc expects the floating-point state, including the saved value of mxcsr, at a particular offset within struct ucontext_t. However the kernel pushes the floating point state at a separate location on the stack (which happens to overlap where libc expects it), and includes a pointer to it inside struct ucontext_t. So libc's setcontext attempts to load some garbage value into mxcsr, which has some of the reserved bits 16-31 set, and this causes a general protection fault.

However, as noted above, this mismatch is the least of the problems.