Why does a fully static Rust ELF binary have a Global Offset Table (GOT) section?

This code, when compiled for the x86_64-unknown-linux-musl target, produces a .got section:

fn main() {
    println!("Hello, world!");
}

$ cargo build --release --target x86_64-unknown-linux-musl
$ readelf -S hello
There are 30 section headers, starting at offset 0x26dc08:

Section Headers:
[Nr] Name              Type             Address           Offset
   Size              EntSize          Flags  Link  Info  Align
...
[12] .got              PROGBITS         0000000000637b58  00037b58
   00000000000004a8  0000000000000008  WA       0     0     8
...

According to this answer for analogous C code, the .got section is an artifact that can be safely removed. However, it segfaults for me:

$ objcopy -R.got hello hello_no_got
$ ./hello_no_got
[1]    3131 segmentation fault (core dumped)  ./hello_no_got

Looking at the disassembly, I see that the GOT basically holds static function addresses:

$ objdump -d hello -M intel
...
0000000000400340 <_ZN5hello4main17h5d434a6e08b2e3b8E>:
...
  40037c:       ff 15 26 7a 23 00       call   QWORD PTR [rip+0x237a26]        # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
...

$ objdump -s -j .got hello | grep 637da8
637da8 50434000 00000000 b0854000 00000000  PC@.......@.....

$ objdump -d hello -M intel | grep 404350
0000000000404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>:
  404350:       41 57                   push   r15

The number 404350 comes from 50434000 00000000, which is a little-endian 0x00000000000404350 (this was not obvious; I had to run the binary under GDB to figure this out!)

This is perplexing, since Wikipedia says that

[GOT] is used by executed programs to find during runtime addresses of global variables, unknown in compile time. The global offset table is updated in process bootstrap by the dynamic linker.

Why is the GOT present? From the disassembly, it looks like the compiler knows all the needed addresses. As far as I know, there is no bootstrap done by the dynamic linker: there is neither INTERP nor DYNAMIC program headers present in my binary;
Why does the GOT store function pointers? Wikipedia says the GOT is only for global variables, and function pointers should be contained in the PLT.

Solution

TL;DR summary: the GOT is really a rudimentary build artifact, which I was able to get rid of via simple machine code manipulations.

Breakdown

If we look at

$ objdump -dj .text hello

and search for GLOBAL, we see only four distinct types of references to the GOT (constants differ):

  40037c:       ff 15 26 7a 23 00       call   QWORD PTR [rip+0x237a26]        # 637da8 <_GLOBAL_OFFSET_TABLE_+0x250>
  425903:       ff 25 5f 26 21 00       jmp    QWORD PTR [rip+0x21265f]        # 637f68 <_GLOBAL_OFFSET_TABLE_+0x410>
  41d8b5:       48 3b 1d b4 a5 21 00    cmp    rbx,QWORD PTR [rip+0x21a5b4]    # 637e70 <_GLOBAL_OFFSET_TABLE_+0x318>
  40b259:       48 83 3d 7f cb 22 00    cmp    QWORD PTR [rip+0x22cb7f],0x0    # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
  40b260:       00

All of these are reading instructions, which means that the GOT is not modified at runtime. This in turn means that we can statically resolve the addresses that the GOT refers to! Let's consider the reference types one by one:

call QWORD PTR [rip+0x2126be] simply says "go to address [rip+0x2126be], take 8 bytes from there, interpret them as a function address and call the function". We can simply replace this instruction with a direct call:

  40037c:       e8 cf 3f 00 00          call   404350 <_ZN3std2io5stdio6_print17h522bda9f206d7fddE>
  400381:       90                      nop

Notice the nop at the end: we need to replace all the 6 bytes of the machine code that constitute the first instruction, but the instruction we replace it with is only 5 bytes, so we need to pad it. Fundamentally, as we are patching a compiled binary, we can replace an instruction with a another one only if it is not longer.

jmp QWORD PTR [rip+0x21265f] is the same as the previous one, but instead of calling an address it jumps to it. This turns into:

  425903:       e9 b8 f7 ff ff          jmp    4250c0 <_ZN68_$LT$core..fmt..builders..PadAdapter$u20$as$u20$core..fmt..Write$GT$9write_str17hc384e51187942069E>
  425908:       90                      nop

cmp rbx,QWORD PTR [rip+0x21a5b4] - this takes 8 bytes from [rip+0x21a5b4] and compares them to the contents of rbx register. This one is tricky, since cmp can not compare register contents to an 64-bit immediate value. We could use another register for that, but we don't know which of the registers are used around this instruction. A careful solution would be something like

push rax
mov rax,0x0000006363c0
cmp rbx,rax
pop rax

But that would be way beyond our limit of 7 bytes. The real solution stems from an observation that the GOT contains only addresses; our address space is (roughly) contained in range [0x400000; 0x650000], which can be seen in the program headers:

$ readelf -l hello
...
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000035b50 0x0000000000035b50  R E    0x200000
  LOAD           0x0000000000036380 0x0000000000636380 0x0000000000636380
                 0x0000000000001dd0 0x0000000000003918  RW     0x200000
...

It follows that we can (mostly) get away with only comparing 4 bytes of a GOT entry instead of 8. So the substitution is:

  41d8b5:       81 fb c0 63 63 00       cmp    ebx,0x6363c0
  41d8bb:       90                      nop

The last one consists of two lines of objdump output, since 8 bytes do not fit in one line:

  40b259:       48 83 3d 7f cb 22 00    cmp    QWORD PTR [rip+0x22cb7f],0x0    # 637de0 <_GLOBAL_OFFSET_TABLE_+0x288>
  40b260:       00

It just compares 8 bytes of the GOT to a constant (in this case, 0x0). In fact, we can do the comparison statically; if the operands compare equal, we replace the comparison with

  40b259:       48 39 c0                cmp    rax,rax
  40b25c:       90                      nop
  40b25d:       90                      nop
  40b25e:       90                      nop
  40b25f:       90                      nop
  40b260:       90                      nop

Obviously, a register is always equal to itself. A lot of padding needed here!

If the left operand is greater than the right one, we replace the comparison with

  40b259:       48 83 fc 00             cmp    rsp,0x0 
  40b25d:       90                      nop
  40b25e:       90                      nop
  40b25f:       90                      nop
  40b260:       90                      nop

In practice, rsp is always greater than zero.

If the left operand is smaller than the right one, things get a bit more complicated, but since we have a whole lot of bytes (8!) we can manage:

  40b259:  50                      push   rax
  40b25a:  31 c0                   xor    eax,eax
  40b25c:  83 f8 01                cmp    eax,0x1
  40b25f:  58                      pop    rax
  40b260:  90                      nop

Notice that the second and the third instructions use eax instead of rax, since cmp and xor involving eax take one less byte than with rax.

Testing

I have written a Python script to do all these substitutions automatically (it's a bit hacky and relies on parsing of objdump output though):

#!/usr/bin/env python3

import re
import sys
import argparse
import subprocess

def read_u64(binary):
    return sum(binary[i] * 256 ** i for i in range(8))

def distance_u32(start, end):
    assert abs(end - start) < 2 ** 31
    diff = end - start
    if diff < 0:
        return 2 ** 32 + diff
    else:
        return diff

def to_u32(x):
    assert 0 <= x < 2 ** 32
    return bytes((x // (256 ** i)) % 256 for i in range(4))

class GotInstruction:
    def __init__(self, lines, symbol_address, symbol_offset):
        self.address = int(lines[0].split(":")[0].strip(), 16)
        self.offset = symbol_offset + (self.address - symbol_address)
        self.got_offset = int(lines[0].split("(File Offset: ")[1].strip().strip(")"), 16)
        self.got_offset = self.got_offset % 0x200000  # No idea why the offset is actually wrong
        self.bytes = []
        for line in lines:
            self.bytes += [int(x, 16) for x in line.split("\t")[1].split()]

class TextDump:
    symbol_regex = re.compile(r"^([0-9,a-f]{16}) <(.*)> \(File Offset: 0x([0-9,a-f]*)\):")

    def __init__(self, binary_path):
        self.got_instructions = []
        objdump_output = subprocess.check_output(["objdump", "-Fdj", ".text", "-M", "intel",
                                                  binary_path])
        lines = objdump_output.decode("utf-8").split("\n")
        current_symbol_address = 0
        current_symbol_offset = 0
        for line_group in self.group_lines(lines):
            match = self.symbol_regex.match(line_group[0])
            if match is not None:
                current_symbol_address = int(match.group(1), 16)
                current_symbol_offset = int(match.group(3), 16)
            elif "_GLOBAL_OFFSET_TABLE_" in line_group[0]:
                instruction = GotInstruction(line_group, current_symbol_address,
                                             current_symbol_offset)
                self.got_instructions.append(instruction)

    @staticmethod
    def group_lines(lines):
        if not lines:
            return
        line_group = [lines[0]]
        for line in lines[1:]:
            if line.count("\t") == 1:  # this line continues the previous one
                line_group.append(line)
            else:
                yield line_group
                line_group = [line]
        yield line_group

    def __iter__(self):
        return iter(self.got_instructions)

def read_binary_file(path):
    try:
        with open(path, "rb") as f:
            return f.read()
    except (IOError, OSError) as exc:
        print(f"Failed to open {path}: {exc.strerror}")
        sys.exit(1)

def write_binary_file(path, content):
    try:
        with open(path, "wb") as f:
            f.write(content)
    except (IOError, OSError) as exc:
        print(f"Failed to open {path}: {exc.strerror}")
        sys.exit(1)

def patch_got_reference(instruction, binary_content):
    got_data = read_u64(binary_content[instruction.got_offset:])
    code = instruction.bytes
    if code[0] == 0xff:
        assert len(code) == 6
        relative_address = distance_u32(instruction.address, got_data)
        if code[1] == 0x15:  # call QWORD PTR [rip+...]
            patch = b"\xe8" + to_u32(relative_address - 5) + b"\x90"
        elif code[1] == 0x25:  # jmp QWORD PTR [rip+...]
            patch = b"\xe9" + to_u32(relative_address - 5) + b"\x90"
        else:
            raise ValueError(f"unknown machine code: {code}")
    elif code[:3] == [0x48, 0x83, 0x3d]:  # cmp QWORD PTR [rip+...],<BYTE>
        assert len(code) == 8
        if got_data == code[7]:
            patch = b"\x48\x39\xc0" + b"\x90" * 5  # cmp rax,rax
        elif got_data > code[7]:
            patch = b"\x48\x83\xfc\x00" + b"\x90" * 3  # cmp rsp,0x0
        else:
            patch = b"\x50\x31\xc0\x83\xf8\x01\x90"  # push rax
                                                     # xor eax,eax
                                                     # cmp eax,0x1
                                                     # pop rax
    elif code[:3] == [0x48, 0x3b, 0x1d]:  # cmp rbx,QWORD PTR [rip+...]
        assert len(code) == 7
        patch = b"\x81\xfb" + to_u32(got_data) + b"\x90"  # cmp ebx,<DWORD>
    else:
        raise ValueError(f"unknown machine code: {code}")
    return dict(offset=instruction.offset, data=patch)

def make_got_patches(binary_path, binary_content):
    patches = []
    text_dump = TextDump(binary_path)
    for instruction in text_dump.got_instructions:
        patches.append(patch_got_reference(instruction, binary_content))
    return patches

def apply_patches(binary_content, patches):
    for patch in patches:
        offset = patch["offset"]
        data = patch["data"]
        binary_content = binary_content[:offset] + data + binary_content[offset + len(data):]
    return binary_content

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("binary_path", help="Path to ELF binary")
    parser.add_argument("-o", "--output", help="Output file path", required=True)
    args = parser.parse_args()

    binary_content = read_binary_file(args.binary_path)
    patches = make_got_patches(args.binary_path, binary_content)
    patched_content = apply_patches(binary_content, patches)
    write_binary_file(args.output, patched_content)

if __name__ == "__main__":
    main()

Now we can get rid of the GOT for real:

$ cargo build --release --target x86_64-unknown-linux-musl
$ ./resolve_got.py target/x86_64-unknown-linux-musl/release/hello -o hello_no_got
$ objcopy -R.got hello_no_got
$ readelf -e hello_no_got | grep .got
$ ./hello_no_got
Hello, world!

I have also tested it on my ~3k LOC app, and it seems to work alright.

P.S. I am not an expert in assembly, so some of the above might be inaccurate.