Search code examples
linux-kernelx86operating-systemkernelinterrupt

When kernel stack's esp is stored to TSS for interrupt return iret?


When I read Intel's X86 programmer's manual, see the following for interrupt & interrupt return with stack switching:

interrupt:

If a stack switch does occur, the processor does the following:

  1. Temporarily saves (internally) the current contents of the SS, ESP, EFLAGS, CS, and EIP registers.
  2. Loads the segment selector and stack pointer for the new stack (that is, the stack for the privilege level being called) from the TSS into the SS and ESP registers and switches to the new stack.
  3. Pushes the temporarily saved SS, ESP, EFLAGS, CS, and EIP values for the interrupted procedure’s stack onto the new stack.
  4. Pushes an error code on the new stack (if appropriate).
  5. Loads the segment selector for the new code segment and the new instruction pointer (from the interrupt gate or trap gate) into the CS and EIP registers, respectively.
  6. If the call is through an interrupt gate, clears the IF flag in the EFLAGS register.
  7. Begins execution of the handler procedure at the new privilege level.

On return:

  1. Performs a privilege check.
  2. Restores the CS and EIP registers to their values prior to the interrupt or exception.
  3. Restores the EFLAGS register.
  4. Restores the SS and ESP registers to their values prior to the interrupt or exception, resulting in a stack switch back to the stack of the interrupted procedure.
  5. Resumes execution of the interrupted procedure.

For example, one linux process P:

  1. It's initially in kernel mode
  2. It returns to user mode by iret. But from the manual, there is no change to TSS
  3. It traps into kernel by int. Here it needs to find the kernel stack from ESP & SS in TSS. How is this kernel stack value set up, since they are not stored to TSS in step 2?

Solution

  • Once the kernel returns to user-space for a given task, it's done with that task's kernel stack until the next interrupt / exception. There's no useful data on it, so the TSS can hold a fixed SS:[ER]SP value that points to the top of the virtual page[s] allocated as the kernel stack for the current task.

    Kernel state doesn't live on the kernel stack between entries into the kernel; it's kept elsewhere in a process control block. (Context switches between asks actually happen in the kernel, switching kernel stacks to the formerly-sleeping task's kernel stack, so eventually returning to user-space means returning up the call-chain of whatever that task was doing in the kernel first).

    BTW, unless the kernel pushes a new CS:EIP / EFLAGS / SS:ESP for iret to pop, the stuff it pops will be the stuff pushed by hardware at the address specified in the TSS. So even if there was some desire to re-enter the kernel with the stack as you left it, that would normally be at the TSS location anyway. But this is irrelevant because Linux doesn't keep stuff on a task's kernel stack while user-space is running, except for a pointer to per-task stuff at the bottom of the region where the kernel can find it with [ER]SP & -16384.

    (I think this is right; I've looked at a few bits of Linux kernel code but haven't really gotten my hands dirty experimenting with things. I think this is how Linux works, and a consistent viable design.)