Search code examples
node.jslinuxgdbv8lldb

Backtrace doesn't work in GDB but works in LLDB


I'm trying to debug a nodejs core dump as an experimentation. My main target was simulating a production problem and seeing V8 stack trace in a core dump. Therefore I intentionally wrote a piece of code in NodeJS that performs constant string concatenation in a forked NodeJS process, and at the end due to memory limits, it's been aborted and the core dump is taken.

So far so good, However after installing GDB and providing the dump file, it's loaded but when I execute bt, it only shows the following output.

$ gdb /usr/local/bin/node -c core.f8f32091796c.node.1700129772.28
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/bin/node...
[New LWP 28]
[New LWP 30]
[New LWP 29]
[New LWP 33]
[New LWP 31]
[New LWP 32]
[New LWP 34]
Core was generated by `/usr/local/bin/node template.js'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fa1a8f5a3f2 in setjmp () from /lib/ld-musl-x86_64.so.1
[Current thread is 1 (LWP 28)]
(gdb) bt
#0  0x00007fa1a8f5a3f2 in setjmp () from /lib/ld-musl-x86_64.so.1
#1  0x00007fa1a8f5a54d in raise () from /lib/ld-musl-x86_64.so.1
#2  0x00007fa1a8f5b9a9 in ?? () from /lib/ld-musl-x86_64.so.1
#3  0x00007fa1a8faae98 in ?? () from /lib/ld-musl-x86_64.so.1
#4  0x0000000000000000 in ?? ()
(gdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0x00007fa1a8e317d0  0x00007fa1a8ebab61  Yes (*)     /usr/lib/libstdc++.so.6
0x00007fa1a8d5c2f0  0x00007fa1a8d6c551  Yes (*)     /usr/lib/libgcc_s.so.1
0x00007fa1a8f29070  0x00007fa1a8f70761  Yes (*)     /lib/ld-musl-x86_64.so.1
(*): Shared library is missing debugging information.
(gdb) info frame
Stack level 0, frame at 0x7ffce26c2038:
 rip = 0x7fa1a8f5a3f2 in setjmp; saved rip = 0x7fa1a8f5a54d
 called by frame at 0x7ffce26c2050
 Arglist at 0x7ffce26c2030, args:
 Locals at 0x7ffce26c2030, Previous frame's sp is 0x7ffce26c2040
 Saved registers:
  rip at 0x7ffce26c2038

If I load the same dump file with LLDB it works as expected and shows the following output.

(lldb) target create "/usr/local/bin/node" --core "core.f8f32091796c.node.1700129772.28"
Core file '/dump/core.f8f32091796c.node.1700129772.28' (x86_64) was loaded.

(lldb) bt
* thread #1, name = 'node', stop reason = signal SIGABRT
  * frame #0: 0x00007fa1a8f5a3f2 ld-musl-x86_64.so.1`__setjmp + 118
    frame #1: 0x00007fa1a8f5a54d ld-musl-x86_64.so.1`raise + 64
    frame #2: 0x00007fa1a8f30f25 ld-musl-x86_64.so.1`abort + 14
    frame #3: 0x00005641a6ef5e55 node`node::Abort() + 37
    frame #4: 0x00005641a6e00d27 node`node::OnFatalError(char const*, char const*) + 283
    frame #5: 0x00005641a70ed0e2 node`v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) + 82
    frame #6: 0x00005641a70ed46f node`v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) + 847
    frame #7: 0x00005641a72bf365 node`v8::internal::Heap::FatalProcessOutOfMemory(char const*) + 21
...
...

Both are running on the same container and here is the version information of all the tools used.

$ cat /etc/*release
3.14.3
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.14.3
PRETTY_NAME="Alpine Linux v3.14"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"

$ lldb --version
lldb version 11.1.0

$ gdb --version
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

$ node --version
v16.13.0

I want to make it run with GDB, I know there is llnode. Because of that, I went down the rabbit hole and explored all the related concepts like stack unwinding, .eh_frame, debug frames etc. even compiled and installed musl-dbg from alpine repository but no success. Gdb complains about CRC Mismatch. But I don't think its the root cause, because LLDB works without installing any debug packages.

Update 1

I have upgraded the container to node:20.10-alpine3.18 and installed latest GDB version(13.1) available on that Alpine version. to make sure it wasn't a bug in GDB 10.2. Reproduced a new core dump but nothing has changed, still GDB can't show the backtrace but LLDB can.

Update 2

Since I have upgraded the alpine version, I thought I could find the matching musl-dbg package to get rid of CRC mismatch error, and gave it a last chance and installed the matching package finally. Surprise It's working up to some specific frame but can't show some frames like LLDB. Here is some part of the working GDB output.

(gdb) bt
#0  __restore_sigs (set=set@entry=0x7fffdc487220) at ./arch/x86_64/syscall_arch.h:40
#1  0x00007ff282f41e5c in raise (sig=sig@entry=6) at src/signal/raise.c:11
#2  0x00007ff282f14fa8 in abort () at src/exit/abort.c:11
#3  0x000055d41a8c3e86 in node::Abort() ()
#4  0x000055d41a7865cd in node::OOMErrorHandler(char const*, v8::OOMDetails const&) ()
#5  0x000055d41aaf8010 in v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) ()
#6  0x000055d41aaf82ff in v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) ()
#7  0x000055d41ad28917 in v8::internal::Heap::FatalProcessOutOfMemory(char const*) ()
#8  0x000055d41ad41d61 in v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) ()
...
...

It's good to see it works but the main question still remains, why can LLDB show all the frames without having a debug symbol but GDB can't?


Solution

  • I went down the rabbit hole and explored why it wasn't working. Basically, GDB was using a prologue based unwinding strategy but it was wrongly calculating the frames probably due to setjmp function written purely in assembly in musl-libc. At the end, I wrote a custom unwinder for musl frames and it worked. This is the code, also I have written a blog series to explain the problem and solution, unfortunately it's not in English.

    import re
    import gdb
    from gdb.unwinder import Unwinder
    
    
    def debug(pc, current_rsp, offset, addr, frame_id, func):
        print('=============debug===========')
        print('{:<20}:{:<8}'.format('function',func))
        print('{:<20}:{:<8}'.format('pc', str(pc)))
        print('{:<20}:{:<8}'.format('current_rsp', str(current_rsp)))
        print('{:<20}:{:<8}'.format('offset', str(offset)))
        print('{:<20}:{:<8}'.format('return address', hex(addr)))
        print('{:<20}:{:<8}'.format('frame_id', str(frame_id)))
    
    u64_ptr = gdb.lookup_type('unsigned long long').pointer()
    
    class FrameID:
        def __init__(self, sp, pc):
            self.sp = sp
            self.pc = pc
    
        def __str__(self):
            return f'sp: {self.sp}, pc: {self.pc}'
    
    class MuslUnwinder(Unwinder):
        def __init__(self):
            super().__init__("musl_unwinder")
    
        def is_musl_frame(self,pc):
            obj = gdb.execute("info symbol 0x%x" % pc, False, True)
            return "musl" in obj
    
        def dereference(self,adr):
            deref = gdb.parse_and_eval("0x%x" % adr).cast(u64_ptr).dereference()
            return deref
    
        def __call__(self, pending_frame):
            pc = pending_frame.read_register("pc")
            if not self.is_musl_frame(pc):
                return None
            asm = gdb.execute("disassemble 0x%x" % pc, False, True)
            lines = asm.splitlines()
            func = None
            args_bytes = 0
            locals_bytes = 0
            rbp_bytes = 0
    
            for line in lines:
                m = re.match('Dump of assembler code for function (.*):', line)
                if m:
                    func = m.group(1)
                elif re.match('.*push[ ]*%', line):
                    args_bytes += 8 
                    if "rbp" in line:
                        rbp_bytes += 8
                elif m := re.match('.*sub[ ]*\\$0x([A-Fa-f0-9]+),%rsp', line):
                    locals_bytes = int(m.group(1), 16)
                    break
    
            offset = locals_bytes + args_bytes
            current_rsp = pending_frame.read_register("rsp")
            current_rbp = pending_frame.read_register("rbp")
            rsp = current_rsp + offset + 8
            return_addr = self.dereference(current_rsp + offset)
            frame_id = FrameID(rsp, pc)
    
            unwind_info = pending_frame.create_unwind_info(frame_id)
            unwind_info.add_saved_register("rsp", rsp)
            unwind_info.add_saved_register("rip", return_addr)
    
            if rbp_bytes > 0:
                saved_rbp = self.dereference(current_rsp+locals_bytes+rbp_bytes)
                unwind_info.add_saved_register("rbp", saved_rbp)
            else:
                unwind_info.add_saved_register("rbp", current_rbp)
    
            if gdb.parameter("verbose"):
                debug(pc, current_rsp, offset, return_addr, frame_id, func)
    
            return unwind_info
    
    gdb.execute('set disassembly-flavor att')
    gdb.unwinder.register_unwinder(None, MuslUnwinder(), replace=True)
    gdb.invalidate_cached_frames()