Debugging crash with 'bt' doesn't print anything in GDB

I have got a random crash in my application (multi-threaded) and I am trying to debug it.

However, when I use 'bt' command I get the following output (instead of a trace):

#0 0x9f665582 in ?? ()

I do not know that is causing this. So to look into the details, I tried to print the current $ip (instruction pointer):

x /i $eip
0x9f665582:    mov   (%esi),%edi

Now, when I try to check the contents of the memory (512 bytes up and down) at both "%esi" and "%edi", I get the following in all cases:

<Address 0xblabla out of bounds>

It looks like destination/source addresses are corrupt, right?

Also, when I run 'list' command I get the source of the parent thread, which does nothing but run in a loop without doing any work. I doubt the parent thread is causing this crash. However, it could be that some thread is corrupting the stack frame of parent thread. But how would I find which data structure/ thread is doing it?

Solution

The absence of a symbol name in the single frame (#0) in the backtrace is consistent with the premise that the %eip value is not within valid code (though the absence of a symbol table can also cause that). If 0x9f665582 is not really within a function, then the data which happens to be there is not necessarily intended to be an instruction, in which case we wouldn't expect %esi to necessarily contain a mapped address. In short, the value of %eip is more likely the issue than the value of %esi.

There are multiple ways that %eip can be set to a bogus value. Stack corruption (mentioned here already) is one way. If something like a buffer overflow clobbers a return address stored on the stack, a return instruction will branch to the value at the clobbered location rather than the correct return address.

Another way %eip can be set to a bogus value is a function pointer dereference through a pointer with a bogus value. One example of how that might occur is a stale reference to a struct containing function pointers. If the memory for such a struct is freed and then overwritten by the rightful (new) owner of that memory, attempts to use that struct will be problematic.

For the purpose of understanding the details of this crash, I'd say there are two things to focus on. One is the values of the various registers; the other is the contents of the stack. One way to find valid return addresses on the stack is to examine ranges of the stack with something like x/32a (the /a leads gdb to look for names corresponding to the addresses). A return address is typically rendered as the function name plus an offset; if you disassemble that function, and the instruction immediately before the one whose address is on the stack is a call instruction, that makes it a return address. It is possible, if tedious, to reconstruct a partial backtrace by matching up return address values on the stack; this is easier if the code uses %ebp as a frame pointer rather than just another register (examination of disassembly can help determine that).

The value of %esp at the time of the crash might tell you which part of the stack was most recently active, although that can be muddied in any number of ways. One thing to keep in mind is that the "instruction" at which the crash occurred might not be the initial bogus value of %eip, but rather just the first "instruction" which attempted to dereference an unmapped address. (I'm quoting "instruction" because depending on where exactly %eip landed, the contents of that memory might not even be legitimate code). Various things can go wrong when branching into the weeds, including an illegal instruction, but in this crash it was an attempt to dereference an unmapped address.

It would seem that the immediate challenge in this situation is finding a coherent frame of reference for something that recently behaved as intended. A reconstructed partial backtrace based on legitimate return addresses seems the most likely candidate for that.

Happy hunting!