Search code examples
linkercompiler-constructionexecutableelf

Does the final executable use symbol tables to check variable scope


I'm trying to understand the linking and loading phases in depth.

When a translation unit is compiled / assembled into a single object file, i understand that it creates a symbol table of every variable / function found.

If a variable has only file scope by using the static keyword for example, it will be marked as local in the symbol table.

However, when the linker produces the final executable file, is there a final symbol table there with every single entry encountered for all files?

I was confused because if we have a variable declared as static meaning only file scope within one file, when this variable is encountered every time in the executable, does the compiler have to reference the final symbol table to see its actual scope, or does it generate special code for it?

Thanks ahead.


Solution

  • When a translation unit is compiled / assembled into a single object file, i understand that it creates a symbol table of every variable / function found.

    That is mostly accurate: local (aka stack, aka automatic storage duration) variables are never put into the symbol table (except when using ancient debugging formats, such as STABS).

    You don't need to take my word for it: this is trivial to observe:

    $ cat foo.c
    int a_common_global;
    int a_global = 42;
    static int a_static = 43;
    
    static int static_fn()
    {
      return 44;
    }
    
    int global_fn()
    {
      int a_local = static_fn();
      static int a_function_static = 1;
      return a_local + a_static + a_function_static;
    }
    
    $ gcc -c foo.c
    $ readelf -Ws foo.o
    
    Symbol table '.symtab' contains 14 entries:
       Num:    Value          Size Type    Bind   Vis      Ndx Name
         0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
         1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS foo.c
         2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1
         3: 0000000000000000     0 SECTION LOCAL  DEFAULT    3
         4: 0000000000000000     0 SECTION LOCAL  DEFAULT    4
         5: 0000000000000004     4 OBJECT  LOCAL  DEFAULT    3 a_static
         6: 0000000000000000    11 FUNC    LOCAL  DEFAULT    1 static_fn
         7: 0000000000000008     4 OBJECT  LOCAL  DEFAULT    3 a_function_static.1800
         8: 0000000000000000     0 SECTION LOCAL  DEFAULT    6
         9: 0000000000000000     0 SECTION LOCAL  DEFAULT    7
        10: 0000000000000000     0 SECTION LOCAL  DEFAULT    5
        11: 0000000000000004     4 OBJECT  GLOBAL DEFAULT  COM a_common_global
        12: 0000000000000000     4 OBJECT  GLOBAL DEFAULT    3 a_global
        13: 000000000000000b    34 FUNC    GLOBAL DEFAULT    1 global_fn
    

    There are a few things worth noting here:

    1. a_local does not appear in the symbol table
    2. a_function_static got "random" number appended to its name. This is so a_function_static in a different function will not collide.
    3. a_static and static_fn have LOCAL linkage

    Note also that while a_static and static_fn appear in the symbol table, this is done only to assist debugging. The local symbols are not used by subsequent link, and can be safely removed.

    After running strip --strip-unneeded foo.o:

    $ readelf -Ws foo.o
    
    Symbol table '.symtab' contains 10 entries:
       Num:    Value          Size Type    Bind   Vis      Ndx Name
         0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
         1: 0000000000000000     0 SECTION LOCAL  DEFAULT    1
         2: 0000000000000000     0 SECTION LOCAL  DEFAULT    3
         3: 0000000000000000     0 SECTION LOCAL  DEFAULT    4
         4: 0000000000000000     0 SECTION LOCAL  DEFAULT    5
         5: 0000000000000000     0 SECTION LOCAL  DEFAULT    6
         6: 0000000000000000     0 SECTION LOCAL  DEFAULT    7
         7: 0000000000000004     4 OBJECT  GLOBAL DEFAULT  COM a_common_global
         8: 0000000000000000     4 OBJECT  GLOBAL DEFAULT    3 a_global
         9: 000000000000000b    34 FUNC    GLOBAL DEFAULT    1 global_fn
    

    when the linker produces the final executable file, is there a final symbol table there with every single entry encountered for all files?

    Yes. Adding main.c like so:

    $ cat main.c
    extern int global_fn();
    
    extern int a_global;
    int a_common_global = 23;
    int main()
    {
      return global_fn() + a_common_global + a_global;
    }
    
    $ gcc -c main.c foo.c
    $ gcc main.o foo.o
    $ readelf -Ws a.out
    
    Symbol table '.symtab' contains 69 entries:
       Num:    Value          Size Type    Bind   Vis      Ndx Name
    

    ... I omit un-interesting entries (there are many).

     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
    
    34: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS main.c
    35: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS foo.c
    36: 0000000000201030     4 OBJECT  LOCAL  DEFAULT   23 a_static
    37: 000000000000061c    11 FUNC    LOCAL  DEFAULT   13 static_fn
    38: 0000000000201034     4 OBJECT  LOCAL  DEFAULT   23 a_function_static.1800
    
    50: 0000000000000627    34 FUNC    GLOBAL DEFAULT   13 global_fn
    
    63: 00000000000005fa    34 FUNC    GLOBAL DEFAULT   13 main
    64: 000000000020102c     4 OBJECT  GLOBAL DEFAULT   23 a_global
    

    I was confused because if we have a variable declared as static meaning only file scope within one file, when this variable is encountered every time in the executable, does the compiler have to reference the final symbol table to see its actual scope, or does it generate special code for it?

    At link stage, the compiler is (usually) not invoked at all. And the linker doesn't (doesn't need to) pay any attention to LOCAL symbols.

    In general, the linker only does two things:

    1. Resolve undefined references (such as reference to global_fn and a_global from main.o) to their definitions (here in foo.o) and
    2. Apply relocations.

    Applying relocations for a_static and a_function_static in foo.o doesn't actually need their names; only their offsets within the .data section, as this output should make clear:

    $ objdump -dr foo.o
    foo.o:     file format elf64-x86-64   
    Disassembly of section .text:
    ...
    000000000000000b <global_fn>:
       b:   55                      push   %rbp
       c:   48 89 e5                mov    %rsp,%rbp
       f:   48 83 ec 10             sub    $0x10,%rsp
      13:   b8 00 00 00 00          mov    $0x0,%eax
      18:   e8 e3 ff ff ff          callq  0 <static_fn>
      1d:   89 45 fc                mov    %eax,-0x4(%rbp)
      20:   8b 15 00 00 00 00       mov    0x0(%rip),%edx        # 26 <global_fn+0x1b>
                22: R_X86_64_PC32   .data
      26:   8b 45 fc                mov    -0x4(%rbp),%eax
      29:   01 c2                   add    %eax,%edx
      2b:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 31 <global_fn+0x26>
                2d: R_X86_64_PC32   .data+0x4
      31:   01 d0                   add    %edx,%eax
      33:   c9                      leaveq
      34:   c3                      retq
    

    Note how relocations at offset 0x22 and 0x2d don't say anything about the names (a_static and a_function_static.1800 respectively).