Search code examples
memorycompilationoperating-systemexecutablemachine-code

How do memory addresses in binary programs point to the right place in memory at runtime?


From what I understand when you compile a program (let's say a C program for example), the Compiler takes your code and outputs a executable program in binary (i.e. machine code for the targeted arch) format.

Within this binary you're going to have instructions that point to addresses in memory to load data/instructions from other parts of the program.

Given this program will be loaded into memory at some arbitrary location, how does the program know what these memory addresses are? How are they set/calculated and who's job is it to do this?

For example, does the binary just have placeholders for the memory locations that are replaced by the OS when it loads it into memory for the first time?

If it needs to dynamically load a shared library how does it work out where the memory location is for that?

How does 'virtual memory' come into play with this? (if at all)


Solution

  • how does the program know what these memory addresses are?

    The program (and its author) does not know what the memory address will be when it's loaded to computer memory, it only knows where the placeholder is, relative to the start of its segment. That's why the compiler accompanies each such placeholder with relocation record. Relocation is a piece of information which tells the OS or the linker

    1. where the relocated address is (its offset in code or data segment)
    2. which segment it is in
    3. which segment or symbol it refers
    4. what kind of relocation should apply on the address

    Consider the following simple piece or source code of Windows Portable executable program:

    [.text]
    Main:NOP
         LEA ESI,[Mem]
         ; more instructions 
    [.data]
         DB "Some data"
    Mem: DB "Other data"
    

    which will be converted to machine instructions and memory data:

    |[.text]                   |[.text]
    |00000000:90               |Main:NOP
    |00000001:8D35[09000000]   |     LEA ESI,[Mem]
    |00000007:                 |     ; more instructions
    |[.data]                   |[.data]
    |00000000:536F6D6520646174~|     DB "Some data"
    |00000009:4F74686572206461~|Mem: DB "Other data"
    

    Compiler does not know the virtual address of Mem, it only knows that it is located 0x00000009 bytes from the start of .data segment, so it will put this temporary number into operation code of LEA ESI,[Mem] and creates relocation of the placeholder (located in segment .text at offset 0x00000003) which is relative to segment .data.

    At link-time the linker decides that .text segment will be loaded at virtual address 0x00401000 and .data segment at VA 0x00402000. Linker then reads the relocation record and modifies the placeholder by adding 0x00402000. Instruction LEA ESI,[Mem] in the linked executable then will be 8D3509204000, which is the final fixed-up virtual address of Mem. We'll be able to see that address in debugger at run-time.

    Relocations are present in linked executable files, too (16bit DOS MZ or Windows PE), for the case that they could not be loaded at the virtual imagebase address assumed at link time. With linking SO libraries in Linux it is more complicated, see chapter 2 Dynamic linking in http://www.skyfree.org/linux/references/ELF_Format.pdf