Search code examples
c++linuxrecursionstackredhat

Is there a way to catch stack overflow in a process? C++ Linux


I have this following code which goes into infinite recursion and triggers a seg fault when it exhausts the stack limit allocated to it. I am trying to capture this segmentation fault and exit gracefully. However, I was not able to catch this segmentation fault in any of the signal numbers.

(A customer is facing this issue and wants a solution for such a use-case. Increasing the stack size by something like "limit stacksize 128M" makes his test pass. However, he is asking for a graceful exit rather than a seg fault. The following code simply reproduces the actual issue not what the actual algorithm does).

Any help is appreciated. If something is incorrect in the way I am trying to catch the signal please let me know that too. To compile: g++ test.cc -std=c++0x

#include <iostream>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <string.h>

int recurse_and_crash (int val)
{
    // Print rough call stack depth at intervals.
    if ((val %1000) == 0)
    {
        std::cout << "\nval: " << val;
    }
    return val + recurse_and_crash (val+1);
}

void signal_handler(int signal, siginfo_t * si, void * arg)
{
    std::cout << "Caught segfault\n";
    exit(0);
}


int main(int argc, char ** argv)
{
    int signal = 11; // SIGSEGV
    if (argc == 2)
    {
        signal = std::stoi(std::string(argv[1]));
    }

    struct sigaction sa;
    memset(&sa, 0, sizeof(struct sigaction));
    sigemptyset(&sa.sa_mask);
    sa.sa_sigaction = signal_handler;
    sa.sa_flags   = SA_SIGINFO;

    sigaction(signal, &sa, NULL);
    recurse_and_crash (1);  
}

Solution

  • This is a surprisingly complex problem to solve. I will at this point not give working code, but rather focus on a few "nifty" issues that you have - or, as you continue coding for this - will encounter.

    First, why are you recursing ?

    The reason for that is that while signal handlers are "execution context transfers", by default they do not have their own stack. That means if you receive a signal as a consequence of an overflown stack, the signal handler will attempt to allocate space-on-the-stack for context potentially passed to it - and that simply re-throws the same signal again.

    To make sure signal handlers run on their own separate / preallocated stack, use sigaltstack() and the SA_ONSTACK flag for sigaction().

    Second, depending on "how badly" the stack overruns (your test program may not trigger this but a real world program may), the memory access (attempt) that's "the overflow-effecting action" may end up with other signals but SIGSEGV.
    Your example "unspecifically" catches all signals, but that may in practice be rather insufficient / rather confusing - you sending your app a SIGUSR1 or the shell/terminal sending it a SIGTTOU on being backgrounded are absolutely not indicative of a stackoverflow.
    This means there's another issue - which signals are to be expected when making an "out of stack" memory access as consequence of a stack overflow ? And how can you know that a specific signal you got was due to a stack access ?
    The answer to that again is more complex than first sight:

    • if the stack overflow is "small enough", it's conceivable that it's within a guard page (a valid mapping, but deliberately unreadable) and it'll trigger SIGSEGV.
    • if (no guard pages are used and) the access were to an unmapped memory region, you'll receive a SIGBUS instead.
    • Even certain CPU instructions may make a difference whether access to "invalid memory address X" results in SIGSEGV or SIGBUS (For example, on x86, certain instructions raise #GP while others #PF - for the same mem address read/write - and the Linux kernel translates one possibly to SIGBUS the other to SIGSEGV)
    • if there happens to be other memory mapped where this access happens (say, you've got char local_to_blow_stack[1ULL << 40]; memset(&local_to_blow_stack, 0, 1);) and just-so-as-it-happens something else valid is at "whatever your stack is minus a terabyte"), that access will in fact just-work. Without the compiler to create you "assist" code to identify such accesses, it's actually possible you've blown the stack and still make a number of successful / non-signaling memory accesses before eventually hitting a mem region triggering a signal.
    • You may receive these signals for other invalid operations but stack access. Heap access, memory-mapped file/device access could possibly result in the same as well.

    So "just catching signals", even "catching all signals that may possibly occur as a consequence of a stack overflow" is insufficient. You need, within the signal handler to decode the memory access location, and possibly the operation / cpu instruction, to verify that the memory access attempted actually was a "stack access out of bounds". It is possible for a thread to retrieve its own stack boundaries - https://man7.org/linux/man-pages/man3/pthread_getattr_np.3.html can be used for this, at least on Linux (_np implies 'non portable' - this isn't guaranteed to be available on all systems, others may have different interfaces to retrieve this information) - but ... to find the memory location that was accessed depends on the signal and accessing instruction again. Often (but not always) it's in the siginfo (the si_addr) field.

    From what I remember, exactly which signals fill si_addr under exactly what circumstances, and whether the address in there is e.g. the instruction issuing the memory access or the memory location of the attempted access, is somewhat system- and hardware-dependent (Linux may behave differently from Windows or MacOSX, and different on ARM than on x86)
    So you would also need to validate that "the si_addr in this siginfo_t is somewhere-near the signaled thread's stack", but possibly also validate that the instruction that caused it was actually a memory access / si_addr can be "traced back" to the instruction that faulted. That (finding the faulting instruction's address / the program counter) ... requires decoding the other argument for the signal handler, the ucontext_t ... and there you're deep deep deep [ recurse infinity here ] in HW / OS specifics.

    At this point I'd like to terminate; a "simple" but not perfect solution just needs an alternate signal stack, and the handler to retrieve the current stack boundaries via pthread_getattr_np(), to compare the si_addr against. If your life or that of others depends on the correct answer, remember the above though.