Search code examples
c++linuxgdbvalgrindaddress-sanitizer

Corrupted stack root cause detection


I have a problem with corrupted stack in multithreaded application.

There is a class:

class A {
public:
/// some public methods
private:
some references to other objects like:
ClassA& ref;
ClassB& ref2;
...
some fields like:
std::map<std::string, enumClass> ...
std::mutex ...
std::map<std::string, someClass> ...
std::mutex again some mutex
std::map<string, std::pair<ClassB, someEnum>> corrupted_map;
bool isTrue;
};

To be more specific issue appeared as a segmentation fault. And that segfault is caused by operator[] on corrupted_map. After debug session it also appeared that one of the field of stl tree has been changed without any operation on corrupted_map. That is why I think it is stack memory corruption. Right leaf of the stl black red tree header points to inaccessible memory. Further investigation shows that another map operation corrupts corrupted_map. In addition another problem is that reproduction of the mention issue takes about 30minutes and requires a lot of traffic. (one of the boxtests).

Analysing core dump is pointless, because corruption happened about 1-2minutes before core dump.

The question for you experts is: how to detect origin of that stack memory corruption? another tools?

I tried with:

ASAN address sanitizer - nothing detected until segfault

GDB - too slow, application is killed before reproduction, a lot of watchdogs, time dependency etc

valgrind - also too slow / and unit tests - nothing detected

static code analyzers - nothing detected

TSAN - thread sanitizer - fixed some detected issues and did not help

I found place which corrupts map with additiona thread that scans stl tree fields every 2ms + additional checks for suspicious methods but well, probably that map operations which is causing mentioned issue is also corrupted.


Solution

  • how to detect origin of that stack memory corruption?

    Almost certainly this is not stack corruption, but heap corruption: none of the elements of the map are on stack.

    ASAN address sanitizer - nothing detected until segfault

    That is surprising -- ASan is usually very good at detecting heap corruption.

    There are a few ways I'd approach this:

    1. run the ASan test 10 or more times.
    2. adjust ASan runtime flags, in particular quarantine_size_mb.

    Why (1)? Sometimes ASan detects a problem and starts reporting it, but before it can finish another thread hits SIGSEGV and causes the process to die without any reports. Repeating the test 100 times may get you a report in one of them; one should be enough!

    Why (2)? As flag description says, use-after-free may not be detected if you are doing a lot of allocations.

    You could also enable detect_stack_use_after_return, and it may detect existing errors, though I doubt you really have a stack problem here.

    P.S. Henri Menke's suggestion to use -D_GLIBCXX_ASSERTIONS and -D_GLIBCXX_DEBUG is also very good one. Documentation.