Search code examples
linuxglibcdynamic-linkingdlopen

Why are imported functions called so indirectly in Linux?


Consider a simple C program:

#include <stdio.h>

int main()
{
    puts("Hello");
    return 0;
}

Running it with GDB, having set LD_BIND_NOW=1 for simplicity, I can observe the following:

$ gdb -q ./test -ex 'b main' -ex r
Reading symbols from ./test...done.
Breakpoint 1 at 0x8048420
Starting program: /tmp/test 

Breakpoint 1, 0x08048420 in main ()
(gdb) disas
Dump of assembler code for function main:
   0x0804841d <+0>:     push   ebp
   0x0804841e <+1>:     mov    ebp,esp
=> 0x08048420 <+3>:     and    esp,0xfffffff0
   0x08048423 <+6>:     sub    esp,0x10
   0x08048426 <+9>:     mov    DWORD PTR [esp],0x8048500
   0x0804842d <+16>:    call   0x80482c0 <puts@plt>
   0x08048432 <+21>:    mov    eax,0x0
   0x08048437 <+26>:    leave  
   0x08048438 <+27>:    ret    
End of assembler dump.
(gdb) si 4
0x080482c0 in puts@plt ()
(gdb) disas
Dump of assembler code for function puts@plt:
=> 0x080482c0 <+0>:     jmp    DWORD PTR ds:0x8049670
   0x080482c6 <+6>:     push   0x0
   0x080482cb <+11>:    jmp    0x80482b0
End of assembler dump.
(gdb) si
_IO_puts (str=0x8048500 "Hello") at ioputs.c:35
35      {
(gdb)

Apparently, after binding the PLT entry to the function, we still do a two-step call:

  1. call puts@plt
  2. jmp [ds:puts_address]

Comparing this with how it's implemented in Win32, there all calls of imported functions, e.g. MessageBoxA, are done like

call [ds:MessageBoxA_address]

i.e. in a single step.

Even if taking lazy binding into account, it's still possible to have e.g. [puts_address] contain the call to _dl_runtime_resolve or whatever is needed on startup, so the one-step indirect call would still work.

So what's the reason for such a complication? Is this some sort of branch prediction or branch target prediction optimization?

EDIT in response to Employed Russian's answer (v2)

What I actually mean is that this indirection of call PLT; jump [GOT] is redundant even in the context of lazy binding. Consider the following example (relies on compilation without optimizations by gcc):

#include <stdio.h>

int main()
{
    for(int i=0;i<3;++i)
    {
        puts("Hello");
        __asm__ __volatile__("nop");
    }
    return 0;
}

Running it (with LD_BIND_NOW unset) in GDB:

$ gdb ./test -ex 'b main' -ex r -ex disas/r
Reading symbols from ./test...done.
Breakpoint 1 at 0x8048387
Starting program: /tmp/test 

Breakpoint 1, 0x08048387 in main ()
Dump of assembler code for function main:
   ...
   0x08048397 <+19>:    c7 04 24 80 84 04 08    mov    DWORD PTR [esp],0x8048480
   0x0804839e <+26>:    e8 11 ff ff ff  call   0x80482b4 <puts@plt>
   0x080483a3 <+31>:    90      nop
   0x080483a4 <+32>:    83 44 24 1c 01  add    DWORD PTR [esp+0x1c],0x1
   ...

Disassembling puts@plt, we can see the address of GOT entry for puts:

(gdb) disas 'puts@plt'
Dump of assembler code for function puts@plt:
   0x080482b4 <+0>:     jmp    DWORD PTR ds:0x8049580
   0x080482ba <+6>:     push   0x10
   0x080482bf <+11>:    jmp    0x8048284
End of assembler dump.

So we see it's 0x8049580. We can patch our code for main() to change e8 11 ff ff ff 90 (address 0x8048e9e) to indirect call to GOT entry, i.e. call [ds:0x8049580]: ff 15 80 95 04 08:

(gdb) set *(uint64_t*)0x804839e=0x44830804958015ff
(gdb) disas/r
Dump of assembler code for function main:
   ...
   0x08048397 <+19>:    c7 04 24 80 84 04 08    mov    DWORD PTR [esp],0x8048480
   0x0804839e <+26>:    ff 15 80 95 04 08       call   DWORD PTR ds:0x8049580
   0x080483a4 <+32>:    83 44 24 1c 01  add    DWORD PTR [esp+0x1c],0x1
   ...

Running the program after this still gives:

(gdb) c
Continuing.
Hello
Hello
Hello
[Inferior 1 (process 14678) exited normally]

I.e. the first call did the lazy binding, and the next two just used the result of fixup (you can trace it yourself if you don't believe).

So the question remains: why is this way of calling not used by GCC?


Solution

  • Apparently, after binding the PLT entry to the function, we still do a two-step call:

    call puts@plt
    jmp [ds:puts_address]
    

    The compiler and linker can't know that you are going to set LD_BIND_NOW=1 at runtime, and so can't go back in time and re-write generated code to use direct call [puts_address].

    See also recent -fno-plt patches on the gcc-patches mailing list.

    Win32

    Win32 doesn't allow lazy function resolution (at least not by default). In other words, they compile / link code that only works as if LD_BIND_NOW=1 is hard-coded at compile / link time. Some history here.

    it's still possible to have e.g. [puts_address] contain the call to _dl_runtime_resolve or whatever is needed on startup, so the one-step indirect call would still work.

    I think you are confused. The [puts_address] does contain _dl_runtime_resolve at startup (well, not exactly. Gory details). Your question is "why can't the call go directly to [puts_address], why is puts@plt needed?".

    The answer is that _dl_runtime_resolve needs to know which function it is resolving. It can't deduce that info from arguments to puts. The entire raison d'être of puts@plt is exactly to supply that info to _dl_runtime_resolve.

    Update:

    Why can't call <puts@plt> be replaced with call *[puts@GOT].

    The answer is provided in the first -fno-plt patch I referenced:

    "This comes with caveats. This cannot be generally done for all functions marked extern as it is impossible for the compiler to say if a function is "truly extern" (defined in a shared library). If a function is not truly extern(ends up defined in the final executable), then calling it indirectly is a performance penalty as it could have been a direct call."

    You could then ask: why can't the linker (which knows whether puts is defined in the same binary or in a separate DSO) rewrite the call *[puts@GOT] back into call <puts@plt>?

    The answer is that these are different instructions (different op-codes), and linkers generally do not change instructions, only addresses within instructions (in response to relocation entries).

    In theory the linker could do this, but no-one's bothered yet.