Which contexts need to be saved in x86-64 with a c function return?

I'm writing a toy os, I know that when switching a thread, we need to save the thread's context so that the application can not feel the occurrence of thread switching. This includes all the xmm, ymm, zmm registers, it is so heavy.

But in one case, the thread switch happens to be the return of a c function, for example the thrd_yield(), according to x86 calling conventions, the callee C function don't need to save xmm, ymm, zmm. if the implementation of thrd_yield() doesn't use xmm, ymm, zmm either, we don't need to save these.

What I want to ask is, under the circumstances, which regs do I need to save? All I can think of is:

regs that the thrd_yield() used
callee-save regs thrd_yield() not used
rflags register(other threads maybe not a c program, maybe it doesn't follow the C language call convention, like changing direct flag.)
MXCSR register

Assume that I'm using the System V ABI, are there any other registers needed to save and how to save? Can you give me some advice? Thanks very much!

Solution

If your context-switch function looks like a normal function call to its callers (which just eventually returns much later), then C compiler generated code that call it will Just Work as long as the function itself saves the call-preserved registers from its caller's context, and restores those regs from the new context. So it looks to the caller like a function-call that follows the standard calling convention.

For x86-64 System V, that's only RSP, RBP, RBX, and R12-R15.
Everything else is call clobbered, like RFLAGS, all the vector regs, AVX-512 mask regs, and x87 st0..7.

The status bits in MXCSR are also basically call-clobbered, but if you want different threads to have different FP environments (e.g. rounding mode and FTZ/DAZ), then you do need to save/restore that. Same for the x87 control register, maybe not the status register.

MPX is deprecated now so you probably don't need to worry about bnd0-3. If you want to have per-task performance-counter stuff, you could save/restore the PMU performance counters like Linux does of PAPI / perf.

Thread-local storage using fsbase or gsbase should be saved/restored if your OS or user-space uses it. There are MSRs for the segment bases (so you can leave the actual segment register values as 0, the null selector). Or if you enable it (for use in user-space or kernel) on a CPU that supports it, rdfsbase / wrfsbase can copy the segment base to/from an integer register even more easily and efficiently than rdmsr / wrmsr. (x86-64 SysV uses FS for thread-local storage.)

An asm caller should treat call thrd_yield exactly like a call to a compiler-generated function, assuming it clobbers all call-clobbered registers, leaving others unmodified.

For RFLAGS specifically, the x86-64 SysV ABI also requires DF=0 before a call, and guarantees DF=0 after it returns. You could make thrd_yield run a cld instruction to support sloppy callers.

It's normal for everything to leave AC=0, don't fault on misaligned loads/stores. If you want some tasks to be able to set it without corrupting each other, then you'll have to save/restore the AC bit in RFLAGS. You might as well save/restore the whole RFLAGS since there's no harm in saving/restoring other stuff along with it.

Of course, if this thrd_yield() function is ever called from an interrupt handler that might run when user-space has valuable state everywhere (pre-emptive multi-tasking), that's a whole different ball-game.

The way Linux manages it is roughly:

Entering the kernel in the first place saves state of integer registers, on a per-thread kernel stack. This will be restored later when returning to user-space for this task, potentially between any two instructions so it's safe for async interrupts.
The vector regs aren't used by kernel code (unless it calls kernel_fpu_begin() first). So the interrupt and system call entry points don't have to run xsave; that can be deferred until switching to a new user-space task. At which point you do xsave (or xsaveopt or whatever) for the old context, then xrstor to load the new after switching to the new task's kernel stack.
Calling switch_to (see How does schedule()+switch_to() functions from linux kernel actually work?) just switches call-preserved integer registers (of the kernel state of the caller), and saves/restores the user-space FP/SIMD state from the vector regs. (Older kernels used to try to defer this, but modern user-space uses movaps all the time for memcpy and stuff.)
When that new kernel state eventually returns back to the syscall or interrupt entry point that got that task into the kernel, the user-space state will be restored. The call-preserved registers will already have been restored by the kernel C functions, but Linux saves/restores all the registers anyway so debuggers (the ptrace system call) can modify that state all in one place.