c++segmentation-fault gdb stack-trace coredump

Analyzing core dump with stack corrupted

I am currently trying to debug a core in my C++ app. The customer has reported a SEGFAULT core with following thread list:

...Other threads go above here
  3 Thread 0xf73a2b70 (LWP 2120)  0x006fa430 in __kernel_vsyscall ()
  2 Thread 0x2291b70 (LWP 2212)  0x006fa430 in __kernel_vsyscall ()
* 1 Thread 0x218fb70 (LWP 2210)  0x00000000 in ?? ()

The thing that puzzles me is the thread that crashed which points 0x00000000. If I try to inspect backtrace, I get:

Thread 1 (Thread 0x1eeeb70 (LWP 27156)):
#0  0x00000000 in ?? ()
#1  0x00281da7 in SomeClass1::_someKnownMethod1 (this=..., elem=...) at path_to_cpp_file:line_number
#2  0x0028484d in SomeClass2::_someKnownMethod2 (this=..., stream=..., stanza=...) at path_to_cpp_file:line_number
#3  0x002958b2 in SomeClass3::_someKnownMethod3 (this=..., stream=..., elem=...) at path_to_cpp_file:line_number

I appologize about redaction - a limitations of NDA.

Obviously, the top frame is quite unknown. My first guess was that PC register got corrupted by some stack overwrite.

I have tried reproducting the issue in my local deployment by supplying the same call that was seen in Frame #1 but the crash never happened.

It is a known fact that these cores are very difficult to debug? But does anyone have some hints on what to try out?

Update

   0x00281d8b <+171>:   mov    edx,DWORD PTR [ebp+0x8]
   0x00281d8e <+174>:   mov    ecx,DWORD PTR [ebp+0xc]
   0x00281d91 <+177>:   mov    eax,DWORD PTR [edx+0x8]
   0x00281d94 <+180>:   mov    edx,DWORD PTR [eax]
   0x00281d96 <+182>:   mov    DWORD PTR [esp+0x8],ecx
   0x00281d9a <+186>:   mov    ecx,DWORD PTR [ebp+0x8]
   0x00281d9d <+189>:   mov    DWORD PTR [esp],eax
   0x00281da0 <+192>:   mov    DWORD PTR [esp+0x4],ecx
   0x00281da4 <+196>:   call   DWORD PTR [edx+0x14]
=> 0x00281da7 <+199>:   mov    ebx,DWORD PTR [ebp-0xc]
   0x00281daa <+202>:   mov    esi,DWORD PTR [ebp-0x8]
   0x00281dad <+205>:   mov    edi,DWORD PTR [ebp-0x4]
   0x00281db0 <+208>:   mov    esp,ebp
   0x00281db2 <+210>:   pop    ebp
   0x00281db3 <+211>:   ret
   0x00281db4 <+212>:   lea    esi,[esi+eiz*1+0x0]

... should have been the one from Frame #0, but from the disassembly this makes little sense. It is like the program has crashed while returning from Frame #1, but why am I seeing the invalid Frame #0? Or does this frame tear down part belongs to a function onPacket?

Update #2:

(gdb) p/x $edx
$5 = 0x1deb664
(gdb) print _listener
$6 = (jax::MyClass &) @0xf6dbf6c4: {_vptr.MyClass= 0x1deb664}

Solution

If frame 1 does not make sense at a source level, you might try looking at disassembly of frame 1. After selecting that frame, disass $pc should show you the disassembly for the entire function, with => to indicate the return address (the instruction immediately after the call to frame 0).

In the case of a null function pointer dereference, the instruction for the call to frame 0 might involve a simple register dereference, in which case you'd want to understand how that register obtained the null value. In some cases including /m in a disass command can be helpful, although it can cause confusion because of the distinction between instruction boundaries and source line boundaries. Omitting /m is more likely to display a meaningful return address.

The => in the updated disassembly (without /m) makes sense. In any frame aside from frame 0, the pc value (what the => points at in the disassembly) indicates the instruction which will execute when the next lowest numbered frame returns (which, due to the crash, did not occur in this case). The pc value in frame 1 is not the value of the pc register at the time of the crash, but rather the saved pc value pushed on the stack by the call instruction. One way to see that is to compare output from x/a $sp in frame 0 to x/i $pc in frame 1.

One way to interpret this disassembly is that edx is some object, and [edx+0x14] points into its vtable. One way the vtable might wind up with a null pointer is a memory allocation issue with a stale reference to a chunk of memory which has been deallocated and subsequently overwritten by its rightful owner (the next piece of code to allocate that chunk). If any of that is applicable here, it can work either way (the code in frame 1 might be the culprit, or it might be the victim). There are other reasons memory might be overwritten with incorrect contents, but double allocation might be a good place to start.

It probably makes sense to examine the contents of the object referenced by edx in frame 1, to see if there are any other anomalies besides what could be an incorrect vtable. Both the print command and the x command (within gdb) can be useful for this. My best guess about which object is referenced by edx, based on disass/m output (at this writing, visible only in the edit history of the question), is _listener, but it would be best to confirm that by further study of the disassembly (the excerpt available here does not seem to include the instruction that determines the value of edx).