Search code examples
rustelfgot

Why does a fully static Rust ELF binary have a Global Offset Table (GOT) section?


This code, when compiled for the x86_64-unknown-linux-musl target, produces a .got section:

fn main() {
    println!("Hello, world!");
}
$ cargo build --release --target x86_64-unknown-linux-musl
$ readelf -S hello
There are 30 section headers, starting at offset 0x26dc08:

Section Headers:
[Nr] Name              Type             Address           Offset
   Size              EntSize          Flags  Link  Info  Align
...
[12] .got              PROGBITS         0000000000637b58  00037b58
   00000000000004a8  0000000000000008  WA       0     0     8
...

According to this answer for analogous C code, the .got section is an artifact that can be safely removed. However, it segfaults for me:

$ objcopy -R.got hello hello_no_got
$ ./hello_no_got
[1]    3131 segmentation fault (core dumped)  ./hello_no_got

Looking at the disassembly, I see that the GOT basically holds static function addresses:

$ objdump -d hello -M intel
...
0000000000400340 <_ZN5hello4main17h5d434a6e08b2e3b8E>:
...
  40037c:       ff 15 26 7a 23 00       call   QWORD PTR [rip+0x237a26]        # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
...

$ objdump -s -j .got hello | grep 637da8
637da8 50434000 00000000 b0854000 00000000  PC@.......@.....

$ objdump -d hello -M intel | grep 404350
0000000000404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>:
  404350:       41 57                   push   r15

The number 404350 comes from 50434000 00000000, which is a little-endian 0x00000000000404350 (this was not obvious; I had to run the binary under GDB to figure this out!)

This is perplexing, since Wikipedia says that

[GOT] is used by executed programs to find during runtime addresses of global variables, unknown in compile time. The global offset table is updated in process bootstrap by the dynamic linker.

  1. Why is the GOT present? From the disassembly, it looks like the compiler knows all the needed addresses. As far as I know, there is no bootstrap done by the dynamic linker: there is neither INTERP nor DYNAMIC program headers present in my binary;
  2. Why does the GOT store function pointers? Wikipedia says the GOT is only for global variables, and function pointers should be contained in the PLT.

Solution

  • TL;DR summary: the GOT is really a rudimentary build artifact, which I was able to get rid of via simple machine code manipulations.

    Breakdown

    If we look at

    $ objdump -dj .text hello
    

    and search for GLOBAL, we see only four distinct types of references to the GOT (constants differ):

      40037c:       ff 15 26 7a 23 00       call   QWORD PTR [rip+0x237a26]        # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
      425903:       ff 25 5f 26 21 00       jmp    QWORD PTR [rip+0x21265f]        # 637f68 <_GLOBAL_OFFSET_TABLE_+0x410>
      41d8b5:       48 3b 1d b4 a5 21 00    cmp    rbx,QWORD PTR [rip+0x21a5b4]    # 637e70 <_GLOBAL_OFFSET_TABLE_+0x318>
      40b259:       48 83 3d 7f cb 22 00    cmp    QWORD PTR [rip+0x22cb7f],0x0    # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
      40b260:       00
    

    All of these are reading instructions, which means that the GOT is not modified at runtime. This in turn means that we can statically resolve the addresses that the GOT refers to! Let's consider the reference types one by one:

    1. call QWORD PTR [rip+0x2126be] simply says "go to address [rip+0x2126be], take 8 bytes from there, interpret them as a function address and call the function". We can simply replace this instruction with a direct call:
      40037c:       e8 cf 3f 00 00          call   404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>
      400381:       90                      nop
    

    Notice the nop at the end: we need to replace all the 6 bytes of the machine code that constitute the first instruction, but the instruction we replace it with is only 5 bytes, so we need to pad it. Fundamentally, as we are patching a compiled binary, we can replace an instruction with a another one only if it is not longer.

    1. jmp QWORD PTR [rip+0x21265f] is the same as the previous one, but instead of calling an address it jumps to it. This turns into:
      425903:       e9 b8 f7 ff ff          jmp    4250c0 <_ZN68_$LT$core..fmt..builders..PadAdapter$u20$as$u20$core..fmt..Write$GT$9write_str17hc384e51187942069E>
      425908:       90                      nop
    
    1. cmp rbx,QWORD PTR [rip+0x21a5b4] - this takes 8 bytes from [rip+0x21a5b4] and compares them to the contents of rbx register. This one is tricky, since cmp can not compare register contents to an 64-bit immediate value. We could use another register for that, but we don't know which of the registers are used around this instruction. A careful solution would be something like
    push rax
    mov rax,0x0000006363c0
    cmp rbx,rax
    pop rax
    

    But that would be way beyond our limit of 7 bytes. The real solution stems from an observation that the GOT contains only addresses; our address space is (roughly) contained in range [0x400000; 0x650000], which can be seen in the program headers:

    $ readelf -l hello
    ...
    Program Headers:
      Type           Offset             VirtAddr           PhysAddr
                     FileSiz            MemSiz              Flags  Align
      LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                     0x0000000000035b50 0x0000000000035b50  R E    0x200000
      LOAD           0x0000000000036380 0x0000000000636380 0x0000000000636380
                     0x0000000000001dd0 0x0000000000003918  RW     0x200000
    ...
    

    It follows that we can (mostly) get away with only comparing 4 bytes of a GOT entry instead of 8. So the substitution is:

      41d8b5:       81 fb c0 63 63 00       cmp    ebx,0x6363c0
      41d8bb:       90                      nop
    
    1. The last one consists of two lines of objdump output, since 8 bytes do not fit in one line:
      40b259:       48 83 3d 7f cb 22 00    cmp    QWORD PTR [rip+0x22cb7f],0x0    # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
      40b260:       00
    

    It just compares 8 bytes of the GOT to a constant (in this case, 0x0). In fact, we can do the comparison statically; if the operands compare equal, we replace the comparison with

      40b259:       48 39 c0                cmp    rax,rax
      40b25c:       90                      nop
      40b25d:       90                      nop
      40b25e:       90                      nop
      40b25f:       90                      nop
      40b260:       90                      nop
    

    Obviously, a register is always equal to itself. A lot of padding needed here!

    If the left operand is greater than the right one, we replace the comparison with

      40b259:       48 83 fc 00             cmp    rsp,0x0 
      40b25d:       90                      nop
      40b25e:       90                      nop
      40b25f:       90                      nop
      40b260:       90                      nop
    

    In practice, rsp is always greater than zero.

    If the left operand is smaller than the right one, things get a bit more complicated, but since we have a whole lot of bytes (8!) we can manage:

      40b259:  50                      push   rax
      40b25a:  31 c0                   xor    eax,eax
      40b25c:  83 f8 01                cmp    eax,0x1
      40b25f:  58                      pop    rax
      40b260:  90                      nop
    

    Notice that the second and the third instructions use eax instead of rax, since cmp and xor involving eax take one less byte than with rax.

    Testing

    I have written a Python script to do all these substitutions automatically (it's a bit hacky and relies on parsing of objdump output though):

    #!/usr/bin/env python3
    
    import re
    import sys
    import argparse
    import subprocess
    
    def read_u64(binary):
        return sum(binary[i] * 256 ** i for i in range(8))
    
    def distance_u32(start, end):
        assert abs(end - start) < 2 ** 31
        diff = end - start
        if diff < 0:
            return 2 ** 32 + diff
        else:
            return diff
    
    def to_u32(x):
        assert 0 <= x < 2 ** 32
        return bytes((x // (256 ** i)) % 256 for i in range(4))
    
    class GotInstruction:
        def __init__(self, lines, symbol_address, symbol_offset):
            self.address = int(lines[0].split(":")[0].strip(), 16)
            self.offset = symbol_offset + (self.address - symbol_address)
            self.got_offset = int(lines[0].split("(File Offset: ")[1].strip().strip(")"), 16)
            self.got_offset = self.got_offset % 0x200000  # No idea why the offset is actually wrong
            self.bytes = []
            for line in lines:
                self.bytes += [int(x, 16) for x in line.split("\t")[1].split()]
    
    class TextDump:
        symbol_regex = re.compile(r"^([0-9,a-f]{16}) <(.*)> \(File Offset: 0x([0-9,a-f]*)\):")
    
        def __init__(self, binary_path):
            self.got_instructions = []
            objdump_output = subprocess.check_output(["objdump", "-Fdj", ".text", "-M", "intel",
                                                      binary_path])
            lines = objdump_output.decode("utf-8").split("\n")
            current_symbol_address = 0
            current_symbol_offset = 0
            for line_group in self.group_lines(lines):
                match = self.symbol_regex.match(line_group[0])
                if match is not None:
                    current_symbol_address = int(match.group(1), 16)
                    current_symbol_offset = int(match.group(3), 16)
                elif "_GLOBAL_OFFSET_TABLE_" in line_group[0]:
                    instruction = GotInstruction(line_group, current_symbol_address,
                                                 current_symbol_offset)
                    self.got_instructions.append(instruction)
    
        @staticmethod
        def group_lines(lines):
            if not lines:
                return
            line_group = [lines[0]]
            for line in lines[1:]:
                if line.count("\t") == 1:  # this line continues the previous one
                    line_group.append(line)
                else:
                    yield line_group
                    line_group = [line]
            yield line_group
    
        def __iter__(self):
            return iter(self.got_instructions)
    
    def read_binary_file(path):
        try:
            with open(path, "rb") as f:
                return f.read()
        except (IOError, OSError) as exc:
            print(f"Failed to open {path}: {exc.strerror}")
            sys.exit(1)
    
    def write_binary_file(path, content):
        try:
            with open(path, "wb") as f:
                f.write(content)
        except (IOError, OSError) as exc:
            print(f"Failed to open {path}: {exc.strerror}")
            sys.exit(1)
    
    def patch_got_reference(instruction, binary_content):
        got_data = read_u64(binary_content[instruction.got_offset:])
        code = instruction.bytes
        if code[0] == 0xff:
            assert len(code) == 6
            relative_address = distance_u32(instruction.address, got_data)
            if code[1] == 0x15:  # call QWORD PTR [rip+...]
                patch = b"\xe8" + to_u32(relative_address - 5) + b"\x90"
            elif code[1] == 0x25:  # jmp QWORD PTR [rip+...]
                patch = b"\xe9" + to_u32(relative_address - 5) + b"\x90"
            else:
                raise ValueError(f"unknown machine code: {code}")
        elif code[:3] == [0x48, 0x83, 0x3d]:  # cmp QWORD PTR [rip+...],<BYTE>
            assert len(code) == 8
            if got_data == code[7]:
                patch = b"\x48\x39\xc0" + b"\x90" * 5  # cmp rax,rax
            elif got_data > code[7]:
                patch = b"\x48\x83\xfc\x00" + b"\x90" * 3  # cmp rsp,0x0
            else:
                patch = b"\x50\x31\xc0\x83\xf8\x01\x90"  # push rax
                                                         # xor eax,eax
                                                         # cmp eax,0x1
                                                         # pop rax
        elif code[:3] == [0x48, 0x3b, 0x1d]:  # cmp rbx,QWORD PTR [rip+...]
            assert len(code) == 7
            patch = b"\x81\xfb" + to_u32(got_data) + b"\x90"  # cmp ebx,<DWORD>
        else:
            raise ValueError(f"unknown machine code: {code}")
        return dict(offset=instruction.offset, data=patch)
    
    def make_got_patches(binary_path, binary_content):
        patches = []
        text_dump = TextDump(binary_path)
        for instruction in text_dump.got_instructions:
            patches.append(patch_got_reference(instruction, binary_content))
        return patches
    
    def apply_patches(binary_content, patches):
        for patch in patches:
            offset = patch["offset"]
            data = patch["data"]
            binary_content = binary_content[:offset] + data + binary_content[offset + len(data):]
        return binary_content
    
    def main():
        parser = argparse.ArgumentParser()
        parser.add_argument("binary_path", help="Path to ELF binary")
        parser.add_argument("-o", "--output", help="Output file path", required=True)
        args = parser.parse_args()
    
        binary_content = read_binary_file(args.binary_path)
        patches = make_got_patches(args.binary_path, binary_content)
        patched_content = apply_patches(binary_content, patches)
        write_binary_file(args.output, patched_content)
    
    if __name__ == "__main__":
        main()
    

    Now we can get rid of the GOT for real:

    $ cargo build --release --target x86_64-unknown-linux-musl
    $ ./resolve_got.py target/x86_64-unknown-linux-musl/release/hello -o hello_no_got
    $ objcopy -R.got hello_no_got
    $ readelf -e hello_no_got | grep .got
    $ ./hello_no_got
    Hello, world!
    

    I have also tested it on my ~3k LOC app, and it seems to work alright.

    P.S. I am not an expert in assembly, so some of the above might be inaccurate.