pointers assembly x86-64 nasm reverse-engineering

Data always being stored in the same address in elf64 NASM?

I wrote a simple Hello world program in NASM, to then look at using objdump -d out of curiosity. The program is as follows:

BITS 64
SECTION .text
  GLOBAL _start

_start:
  mov rax, 0x01
  mov rdi, 0x00
  mov rsi, hello_world
  mov rdx, hello_world_len
  syscall

  mov rax, 0x3C
  syscall

SECTION .data
  hello_world: db "Hello, world!", 0x0A
  hello_world_len: equ $-hello_world

When I inspected this program, I found that the actual implementation of this uses movabs with the hex value 0x402000 in place of a name, which makes sense, except for the fact that surely this would mean that it knows 'Hello, world!' is going to be stored at 0x402000 everytime the program is run, and there is no reference to 'Hello, world!' anywhere in the output of objdump -d hello_world (the output of which I provided below).

I tried rewriting the program; This time I replaced hello_world on line 8 with mov rsi, 0x402000 and the program still compiled and worked perfectly.

I thought maybe it was some encoding of the name, however changing the text 'hello_world' in SECTION .data did not change the outcome either.

I'm more confused than anything - How does it know the address at compile time, and how come it never changes, even on recompilation?

(OUTPUT OF objdump -d hello_world)

./hello_world:   file format elf64-x86-64

Disassembly of section .text:

0000000000401000 <_start>:
  401000: b8 01 00 00 00       mov    $0x1,%eax
  401005: bf 00 00 00 00       mov    $0x0,%edi
  40100a: 48 be 00 20 40 00 00 movabs $0x402000,%rsi
  401011: 00 00 00
  401014: ba 0e 00 00 00       mov    $0xe,%edx
  401019: 0f 05                syscall
  40101b: b8 3c 00 00 00       mov    $0x3c,%eax
  401020: bf 00 00 00 00       syscall

(as you can see, no 'Disassembly of section .data', which further confuses me)

Solution

The string is known at compile time too. It statically exists in your executable. The compiler put it at the address in the first place, so of course it knows the address!

(And in an ASLR or dylib environment this would still apply, because all addresses relative to the module would get shifted as needed and the compiler would put a relocation entry so the loader knows there is an address reference there to fix up, but they would still stay the same relative to each other.)

And this doesn't mean that every program ever existing will have unique memory locations, nor does it mean that all contents of a program have to idly sit around and use up all of your memory even if they are rarely needed, because this is virtual memory.

The address is only meaningful within your own process, and the memory page in question doesn't have to exist in memory physically, it can be paged in and out as needed, and it's the OS' memory manager's job to decide what to keep in physical memory at what times. Attempting to access an address belonging to a page that's not physically in memory will make it transparently get paged in by the kernel at that point in time. But with such a small program, most likely the whole program will be in memory from the start.

In user-mode code, you will generally never see physical memory addresses. This is entirely abstracted away by the kernel.