Search code examples
gdbshared-librariescoredump

How to find the cause of this segmentation fault using gdb and core-dump file?(Limitation of GDB)


I know I can use core dump file to figure out where the program goes wrong. However, there are some bugs that even you debug it with core file, you still don't know why it goes wrong. So what I want to convey is that the scope of the bugs that gdb and core files can help you to debug is limited. And how limited is that?

For example, I write the following code : (libfoo.c)

#include <stdio.h>
#include <stdlib.h>
void foo(void);
int main()
{
    puts("This is a mis-compiled runnable shared library");
    return 0;
}
void foo()
{
    puts("This is the shared function");
}

The following is the makefile : (Makefile)

.PHONY : all clean
all : libfoo.c
    gcc -g -Wall -shared -fPIC -Wl,-soname,$(basename $^).so.1 -o $(basename $^).so.1.0.0 $^; \
#the correct compiling command should be : 
#gcc -g -Wall -shared -fPIC -pie -Wl,--export-dynamic,-soname,$(basename $^).so.1 -o $(basename $^).so.1.0.0 $^; 
    sudo ldconfig $(CURDIR);                        #this will set up soname link     \ 
    ln -s $(basename $^).so.1.0.0 $(basename $^).so #this will set up linker name link;
clean : 
    -rm libfoo.s*; sudo ldconfig;#roll back

When I ran it ./libfoo.so, I got segmentation fault, and this was because I compiled the runnable shared library in a wrong way. But I wanted to know exactly what was causing the segmentation fault. So I used gdb libfoo.so.1.0.0 corefile, then bt and got the following:

[xhan@localhost Desktop]$ gdb ./libfoo.so core.8326 
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/xiaohan/Desktop/libfoo.so.1.0.0...done.

warning: core file may not match specified executable file.
[New LWP 8326]
Core was generated by `./libfoo.so'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000000001 in ?? ()
(gdb) bt
#0  0x0000000000000001 in ?? ()
#1  0x00007ffd29cd13b4 in ?? ()
#2  0x0000000000000000 in ?? ()
(gdb) quit

But I still don't know what caused the segmentation fault. Debugging the core file can not give me any clue that the cause of my segmentation fault is that I used a wrong compiling command.

Can anyone help me with debugging this? Or can anyone tell me the scope of the bugs that is impossible to debug even using gdb and core file? Answers that respond to only one question will also be accepted.

Thanks!


IMPORTANT ASSUMPTIONS I AM HOLDING:

  1. Some may ask why I want to make a shared library runnable. I do this because I want to compile a shared library what is similar to /lib64/ld-2.17.so.
  2. Of course you can't rely on gdb telling you the cause of every bugs you have made. For example, if you simply chmod +x nonexecutable and run it, then get a bug(usually this will not dump core file), and try to debug it with gdb, that is somewhat "crazy". However, once an "executable" can be loaded and dumps a core file during runtime, you can use gdb to debug it, and furthermore, FIND CLUES about why the program goes wrong. However, in my problem ./libfoo.so, I am totally lost.

Solution

  • the scope of the bugs that gdb and core files can help you to debug is limited.

    Correct: there are several large classes of bugs for which core dump provides little help. The most common (in my experience) are:

    1. Issues that happen at process startup (such as the example you showed).

      GDB needs cooperation with the dynamic loader to tell GDB where various ELF images are mmaped in the process space.

      When the crash happens in the dynamic loader itself, or before the dynamic loader had a chance to tell GDB where things are, you end up with a very confusing picture.

    2. Various heap corruption bugs.

      Usually you can tell that it's likely that heap corruption is the problem (e.g. any crash inside malloc or free is usually a sign of one), but that tells you very little about the root cause of the problem.

      Fortunately, tools like Valgrind and Address Sanitizer can often point you straight at the problem.

    3. Various stack overflow bugs.

      GDB uses contents of current stack to tell you how you got to the function you are in (backtrace).

      But if you overwrite stack memory with garbage, then the record of how you got to where you are is lost. And if you corrupt stack, and then use "grbage" function pointer, then you can end up with a core dump from which you can't tell either where you are, or how you got there.

    4. Various "logical" bugs.

      For example, suppose you have a tree data structure, and a recursive procedure to visit its nodes. If your tree is not a proper tree, and has a cycle in it, your visit procedure will run out of stack and crash.

      But looking at the crash tells you nothing about where the tree ceased to be a tree and turned into a graph.

    5. Data races.

      You may be iterating over elements of std::vector and crash. Examining the vector shows you that it is no longer in valid state.

      That often happens when some other thread modifies the vector (or any other data structure) from under you.

      Again, the crash stack trace tells you very little where the bug actually is.