Search code examples
assemblylinux-kernelbootarm64

Why save init_task struct address to sp_el0 in arm64 boot code __primary_switched?


This is an assembly code from linux arm64 (arch/arm64/kernel/head.S).(kernel source 5.4.21)

__primary_switched:
    adrp    x4, init_thread_union   -- line 1
    add sp, x4, #THREAD_SIZE        -- line 2
    adr_l   x5, init_task           -- line 3
    msr sp_el0, x5          // Save thread_info   -- line 4
    adr_l   x8, vectors         // load VBAR_EL1 with virtual  -- line5
    msr vbar_el1, x8            // vector table address  -- line 6
    isb                    -- line7
    
    stp xzr, x30, [sp, #-16]!            -- line8
    mov x29, sp                   -- line9
    
    str_l   x21, __fdt_pointer, x5      // Save FDT pointer   -- line10

I'll try to explain it and someone please give me light and correct me if I'm wrong.

  • line 1 : x4 = (page address of init_thread_union). I found init_thread_union is a variable in the kernel linkerscript.(arch/arm64/kernel/vmlinux.lds). this vmlinux.lds is genearted from vmlinux.lds.S during kernel building.
  • line 2 : sp = (x4 + #THREAD_SIZE). looks like setting the stack pointer for this thread. (and looks like this thread is using init_thread_union memory region) (to use 4K bytes from location of init_thread_union as stack for this thread)
  • line 3 : x5 = (address of init_task struct), I found init_task is a task_struct for init task.(in init/init_task.c). so this struct contains thread info.
  • line 4 : sp_el0 = x5. why set the stack pointer of exception level 0 with the thread_info? and is this sp_el0 different from the sp in line 2? (I guess we are now in el 1, so sp in line 2 means sp_el1). and x5 is later used in line 10.

I can't understand exactly what this code is doing. especially line 4. What is this code doing?


Solution

  • Alright, crash course on stack management on arm64:

    At EL1 you have two stack pointer registers that you can access with mrs/msr: sp_el1 and sp_el0. You also have a register just called sp that you can access in most other instructions like add, str, etc. And then there's another system register called spsel, which consists of a single bit that controls whether sp is an alias of sp_el1 or sp_el0. To illustrate:

    movz x1, 0x1000
    movz x2, 0x2000
    msr sp_el0, x1
    msr sp_el1, x2
    msr spsel, 0
    add x3, sp, 0x10
    msr spsel, 1
    add x4, sp, 0x20
    // AT this point, x3 == 0x1010 and x4 == 0x2020
    

    In addition, when you're running at EL1 and you eret to EL0, the stack pointer will always be sp_el0. But the reason for all of this is that when you take an exception to EL1 again, your stack pointer is always switched to sp_el1. This is done because every single general-purpose register holds userspace values at that point, and you need a way to save them away without clobbering any of them (or storing to userland memory).
    So what kernels usually do is set up an exception stack in sp_el1 onto which registers can be spilled when taking an exception. When taking an exception from EL1 to EL1 (e.g. an IRQ), then it should usually be safe to store the stack pointer that was in use before the exception was taken, so it is possible to run the kernel itself entirely on sp_el1.
    Most operating systems don't do this, however, and instead add another stack pointer, a "normal kernel stack" if you will. Then the exception flow will look something like this:

    1. Take an exception to EL1, the hardware implicitly switches to sp_el1 and disables interrupts.
    2. Spill all general-purpose registers to the exception stack.
    3. Replace the value in sp_el0 with the address of the "normal kernel stack" pointer.
    4. Switch to spsel, 0 and enable interrupts.

    Because exception vectors are different depending on whether you came from a context running on sp_el0 or sp_el1, this allows you to confine "expected exceptions" to sp_el0, and if you ever take an exception while running on sp_el1, you assume you faulted in a critical section and panic.

    Now for the code you've shown: all it does in the first four instructions is set up the exception and "normal" stack pointers. It seems to be running with spsel, 1.

    Also note that str_l isn't an actual instruction, but a Linux-specific macro:

    /*
     * @src: source register (32 or 64 bit wide)
     * @sym: name of the symbol
     * @tmp: mandatory 64-bit scratch register to calculate the address
     *       while <src> needs to be preserved.
     */
    .macro  str_l, src, sym, tmp
    adrp    \tmp, \sym
    str \src, [\tmp, :lo12:\sym]
    .endm
    

    So the code it would generate is:

    adrp x5, __fdt_pointer
    str x21, [x5, :lo12:__fdt_pointer]
    

    Which, as you can see, doesn't use the old value of x5.