Search code examples
linuxsecurityx86-64memory-addressaslr

ASLR and memory layout on 64 bits: Is it limited to the canonical part (128 TiB)?


When loading a PIE executable with ASLR enabled, will Linux restricts the mapping of the program segments to the canonical section (up to 0000_7fff_ffff_ffff) or will it use the full lower section (starting bit 0)?


Solution

  • Obviously Linux won't give your process unusable addresses, that would make it raise a #GP(0) exception (and thus segfault) when it tries to execute code from _start. (Or if close to the cutoff, when it tries to load or store .data or .bss)

    That would actually happen on the instruction that tried to set RIP to a non-canonical value in the first place, likely an iret or sysret1.


    On systems with 48-bit virtual addresses, zero to 0000_7fff_ffff_ffff is the full lower half of virtual address space when represented as a sign-extended 64-bit value.

    On systems with PML5 supported (and used by the kernel), virtual addresses are 57 bits wide, so
    zero to 00ff_ffff_ffff_ffff is the low-half canonical range.

    See https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt - the first row is the user-space range. (It talks about "56 bit" virtual addresses. That's incorrect or misleading, PML5 is 57-bit, an extra full level of page tables with 9 bits per level. So the low half is 56 bits with a 0 in the 57th and the high half is 56 bits with a 1 in the 57th.)

    ========================================================================================================================
        Start addr    |   Offset   |     End addr     |  Size   | VM area description
    ========================================================================================================================
                      |            |                  |         |
     0000000000000000 |    0       | 00007fffffffffff |  128 TB | user-space virtual memory, different per mm
    __________________|____________|__________________|_________|___________________________________________________________
                      |            |                  |         |
     0000800000000000 | +128    TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
                      |            |                  |         |     virtual memory addresses up to the -128 TB
                      |            |                  |         |     starting offset of kernel mappings.
    __________________|____________|__________________|_________|___________________________________________________________
                                                                |
                                                                | Kernel-space virtual memory, shared between all processes:
    ...
    

    Or for PML5:

     0000000000000000 |    0       | 00ffffffffffffff |   64 PB | user-space virtual memory, different per mm
    __________________|____________|__________________|_________|___________________________________________________________
                      |            |                  |         |
     0000800000000000 |  +64    PB | ffff7fffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
                      |            |                  |         |     virtual memory addresses up to the -64 PB
                      |            |                  |         |     starting offset of kernel mappings.
    

    Footnote 1:
    As prl points out, this design allows an implementation to literally only have 48 actual bits to store RIP values anywhere in the pipeline, except jumps and detecting signed overflow in case execution runs off the end into non-canonical territory. (Maybe saving transistors in every place that has to store a uop, which needs to know its own address.) Unlike if you could jump / iret to an arbitrary RIP, and then the #GP(0) exception would have to push the correct 64-bit non-canonical address, which would mean the CPU would have to remember it temporarily.

    It's also more useful for debugging to see where you jumped from, so it makes sense to design the rule this way because there's no use-case for jumping to a non-canonical address on purpose. (Unlike jumping to an unmapped page, where the #PF exception handler can repair the situation, e.g. by demand paging, so for that you want the fault address to be new RIP.)

    Fun fact: using sysret with a non-canonical RIP on Intel CPUs will #GP(0) in ring 0 (CPL=0), so RSP isn't switched and is still = user stack. If any other threads existed, this would let them mess with memory the kernel was using as a stack. This is a design flaw in IA-32e, Intel's implementation of x86-64. That's why Linux uses iret to return to user space from the syscall entry point if ptrace has been used on this process during that time. The kernel knows a fresh process will have a safe RIP so it might actually use sysret to jump to user-space faster.