Search code examples
cgcclddynamic-linkingdynamic-loading

Loading a dynamic library at run-time yields inconsistent and unexpected results, missing symbols and empty PLT entries. Why?


I've been fighting with this problem for quite some time, and I've been unable to find a solution or even an explanation for it. So sorry if the question is long, but bear with me as I just want to make it 100% clear in the hopes that someone more experienced than me will be able to figure it out.

I'm keeping the C syntax highlight on for all snippets because it makes them a little bit clearer even if not really correct.

What I want to do

I have a C program which uses some functions from a dynamic library (libzip). Here it is boiled down to a minimal reproducible example (it basically does nothing, but it works just fine):

#include <zip.h>

int main(void) {
    int err;
    zip_t *myzip;

    myzip = zip_open("myzip.zip", ZIP_CREATE | ZIP_TRUNCATE, &err);
    if (myzip == NULL)
        return 1;

    zip_close(myzip);

    return 0;
}

Normally, to compile it, I would simply do:

gcc -c prog.c
gcc -o prog prog.o -lzip

This creates, as expected, an ELF which requires libzip to run:

$ ldd prog
linux-vdso.so.1 (0x00007ffdafb53000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f81eedc7000)
/lib64/ld-linux-x86-64.so.2 (0x00007f81ef780000)
libzip.so.4 => /usr/lib/x86_64-linux-gnu/libzip.so.4 (0x00007f81ef166000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f81eebad000)

(libz is just a dependency of libzip)

What I really want to do though, is to load the library myself using dlopen(). Pretty simple task, no? Well yes, or at least I thought.

To achieve this, I should just need to call dlopen and let the loader do its job:

#include <zip.h>
#include <dlfcn.h>

int main(void) {
    void *lib;
    int err;
    zip_t *myzip;

    lib = dlopen("libzip.so", RTLD_LAZY | RTLD_GLOBAL);
    if (lib == NULL)
        return 1;

    myzip = zip_open("myzip.zip", ZIP_CREATE | ZIP_TRUNCATE, &err);
    if (myzip == NULL)
        return 1;

    zip_close(myzip);

    return 0;
}

Of course, since I want to manually load the library myself, I will not link it this time:

# Create prog.o
gcc -c prog.c

# Do a dry-run just to make sure all symbols are resolved
gcc -o /dev/null prog.o -ldl -lzip

# Now recompile only with libdl
gcc -o prog prog.o -ldl -Wl,--unresolved-symbols=ignore-in-object-files

The flag --unresolved-symbols=ignore-in-object-files tells ld to not worry about my prog.o having unresolved symbols at link time (I want to take care of that myself at runtime).

The problem

The above Should Just Work™, and indeed it does seem to... but I have two machines, and being the pedantic nerd I am I just thought "well, better make sure and compile it on both of them".

First machine

x86-64, Linux 4.9, Debian 9, gcc 6.3.0, ld 2.28. Here everything works as expected.

I can clearly see that the symbols are there:

$ readelf --dyn-syms prog

Symbol table '.dynsym' contains 15 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterTMCloneTab
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@GLIBC_2.2.5 (2)
     3: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
===> 4: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND zip_close
     5: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND dlopen@GLIBC_2.2.5 (3)
===> 6: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND zip_open
     7: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _Jv_RegisterClasses
     8: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMCloneTable
     9: 0000000000000000     0 FUNC    WEAK   DEFAULT  UND __cxa_finalize@GLIBC_2.2.5 (2)
    10: 0000000000201040     0 NOTYPE  GLOBAL DEFAULT   25 _edata
    11: 0000000000201048     0 NOTYPE  GLOBAL DEFAULT   26 _end
    12: 0000000000201040     0 NOTYPE  GLOBAL DEFAULT   26 __bss_start
    13: 00000000000006a0     0 FUNC    GLOBAL DEFAULT   11 _init
    14: 0000000000000924     0 FUNC    GLOBAL DEFAULT   15 _fini

The PLT entries are also there as expected and look fine:

$ objdump -j .plt -M intel -d prog

Disassembly of section .plt:

00000000000006c0 <.plt>:
 6c0:   ff 35 42 09 20 00       push   QWORD PTR [rip+0x200942]        # 201008 <_GLOBAL_OFFSET_TABLE_+0x8>
 6c6:   ff 25 44 09 20 00       jmp    QWORD PTR [rip+0x200944]        # 201010 <_GLOBAL_OFFSET_TABLE_+0x10>
 6cc:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]

00000000000006d0 <zip_close@plt>:
 6d0:   ff 25 42 09 20 00       jmp    QWORD PTR [rip+0x200942]        # 201018 <zip_close>
 6d6:   68 00 00 00 00          push   0x0
 6db:   e9 e0 ff ff ff          jmp    6c0 <.plt>

00000000000006e0 <dlopen@plt>:
 6e0:   ff 25 3a 09 20 00       jmp    QWORD PTR [rip+0x20093a]        # 201020 <dlopen@GLIBC_2.2.5>
 6e6:   68 01 00 00 00          push   0x1
 6eb:   e9 d0 ff ff ff          jmp    6c0 <.plt>

00000000000006f0 <zip_open@plt>:
 6f0:   ff 25 32 09 20 00       jmp    QWORD PTR [rip+0x200932]        # 201028 <zip_open>
 6f6:   68 02 00 00 00          push   0x2
 6fb:   e9 c0 ff ff ff          jmp    6c0 <.plt>

And the program runs without any problem:

$ ./prog
$ echo $?
0

Even looking inside it with a debugger I can clearly see the symbols getting correctly resolved like any normal dynamic symbol:

0x55555555479b <main+43>                       lea    rax, [rbp - 0x14]
0x55555555479f <main+47>                       mov    rdx, rax
0x5555555547a2 <main+50>                       mov    esi, 9
0x5555555547a7 <main+55>                       lea    rdi, [rip + 0xc0] <0x7ffff7ffd948>
0x5555555547ae <main+62>                       call   zip_open@plt <0x555555554620>
 |
 v ### PLT entry:
0x555555554620 <zip_open@plt>                  jmp    qword ptr [rip + 0x200a02] <0x555555755028>
 |
 v 
0x555555554626 <zip_open@plt+6>                push   2
0x55555555462b <zip_open@plt+11>               jmp    0x5555555545f0
 |
 v ### PLT stub:
0x5555555545f0                                 push   qword ptr [rip + 0x200a12] <0x555555755008>
0x5555555545f6                                 jmp    qword ptr [rip + 0x200a14] <0x7ffff7def0d0>
 |
 v ### Symbol gets correctly resolved
0x7ffff7def0d0 <_dl_runtime_resolve_fxsave>    push   rbx
0x7ffff7def0d1 <_dl_runtime_resolve_fxsave+1>  mov    rbx, rsp
0x7ffff7def0d4 <_dl_runtime_resolve_fxsave+4>  and    rsp, 0xfffffffffffffff0
0x7ffff7def0d8 <_dl_runtime_resolve_fxsave+8>  sub    rsp, 0x240

Second machine

x86-64, Linux 4.15, Ubuntu 18.04, gcc 7.4, ld 2.30. Here, something really strange is going on.

Compilation doesn't yield any warning or error, but I do not see the symbols:

$ readelf --dyn-syms prog

Symbol table '.dynsym' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_deregisterTMCloneTab
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __libc_start_main@GLIBC_2.2.5 (2)
     3: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __gmon_start__
     4: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND dlopen@GLIBC_2.2.5 (3)
     5: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND _ITM_registerTMCloneTable
     6: 0000000000000000     0 FUNC    WEAK   DEFAULT  UND __cxa_finalize@GLIBC_2.2.5 (2)

The PLT entries are there, but they are filled with zeroes, and aren't even recognized by objdump:

$ objdump -j .plt -M intel -d prog

Disassembly of section .plt:

0000000000000560 <.plt>:
 560:   ff 35 4a 0a 20 00       push   QWORD PTR [rip+0x200a4a]        # 200fb0 <_GLOBAL_OFFSET_TABLE_+0x8>
 566:   ff 25 4c 0a 20 00       jmp    QWORD PTR [rip+0x200a4c]        # 200fb8 <_GLOBAL_OFFSET_TABLE_+0x10>
 56c:   0f 1f 40 00             nop    DWORD PTR [rax+0x0]
    ...

#   ^^^
# Here, these three dots are actually hiding another 0x10+ bytes filled of 0x0
# zip_close@plt should be here instead...

0000000000000580 <dlopen@plt>:
 580:   ff 25 42 0a 20 00       jmp    QWORD PTR [rip+0x200a42]        # 200fc8 <dlopen@GLIBC_2.2.5>
 586:   68 00 00 00 00          push   0x0
 58b:   e9 d0 ff ff ff          jmp    560 <.plt>
    ...

#   ^^^
# Here, these three dots are actually hiding another 0x10+ bytes filled of 0x0
# zip_open@plt should be here instead...

When the program is run, dlopen() works fine and loads libzip into memory, but then when zip_open() gets called, it just generates a segmentation fault:

$ ./prog
Segmentation fault (code dumped)

Taking a look with a debugger, the issue is even more obvious (in case it wasn't already obvious enough). The PLT entries filled with zeroes just end up decoding to a bunch of add instructions dereferencing rax, which contains an invalid address and makes the program segfault and die:

0x5555555546e5 <main+43>               lea    rax, [rbp - 0x14]
0x5555555546e9 <main+47>               mov    rdx, rax
0x5555555546ec <main+50>               mov    esi, 9
0x5555555546f1 <main+55>               lea    rdi, [rip + 0xc6]
0x5555555546f8 <main+62>               call   dlopen@plt+16 <0x555555554590>
 |
 v ### Broken PLT enrty (all 0x0, will cause a segfault):
0x555555554590 <dlopen@plt+16>         add    byte ptr [rax], al
0x555555554592 <dlopen@plt+18>         add    byte ptr [rax], al
0x555555554594 <dlopen@plt+20>         add    byte ptr [rax], al
0x555555554596 <dlopen@plt+22>         add    byte ptr [rax], al
0x555555554598 <dlopen@plt+24>         add    byte ptr [rax], al
0x55555555459a <dlopen@plt+26>         add    byte ptr [rax], al
0x55555555459c <dlopen@plt+28>         add    byte ptr [rax], al
0x55555555459e <dlopen@plt+30>         add    byte ptr [rax], al
   ### Next PLT entry...
0x5555555545a0 <__cxa_finalize@plt>    jmp    qword ptr [rip + 0x200a52] <0x7ffff7823520>
 |
 v
0x7ffff7823520 <__cxa_finalize>        push   r15
0x7ffff7823522 <__cxa_finalize+2>      push   r14

Questions

  1. So, first of all... why is this happening?
  2. I thought that this was supposed to work, isn't it? If not, why? And why only on one of the two machines?
  3. But most importantly: how can I fix this?

For question 3 I want to emphasize that the whole point of this is that I want to load the library myself, without linking it, so please refrain from just commenting that this is bad practice, or whatever else.


Solution

  • The above Should Just Work™, and indeed it does seem to...

    No, it should not, and if it appears to, that's more of an accident. In general, using --unresolved-symbols=... is a really bad idea™, and will almost never do what you want.

    The solution is trivial: you just need to look up zip_open and zip_close, like so:

    int main(void) {
        void *lib;
        zip_t *p_open(const char *, int, int *);
        void *p_close(zip_t*);
        int err;
        zip_t *myzip;
    
        lib = dlopen("libzip.so", RTLD_LAZY | RTLD_GLOBAL);
        if (lib == NULL)
            return 1;
    
        p_open = (zip_t(*)(const char *, int, int *))dlsym(lib, "zip_open");
        if (p_open == NULL)
            return 1;
        p_close = (void(*)(zip_t*))dlsym(lib, "zip_close");
        if (p_close == NULL)
            return 1;
    
        myzip = p_open("myzip.zip", ZIP_CREATE | ZIP_TRUNCATE, &err);
        if (myzip == NULL)
            return 1;
    
        p_close(myzip);
    
        return 0;
    }