linux-kernel simd arm64 cpu-registers sve

In Linux kernel, why zero out the task->thread.sve_state when handling a SVE exception trap?

In Linux v5.10, when handling the SVE accessing exception in do_sve_acc() function, why zero out the thread's SVE state?

I think it should not zero out the SVE state before restoring the SVE state. Am I right?

https://elixir.bootlin.com/linux/v5.10.205/source/arch/arm64/kernel/fpsimd.c#L513

My doubt is that, when the process is trapped by SVE not the first time, namely, the SVE state contains the context to restore. Then after zeroing out the SVE state, what to restore?

Solution

This trap can only happen when there was no SVE state, only ASIMD.

System calls are allowed to discard the SVE state and return to FP/ASIMD only mode for cheaper context-switches. From the big block-comment I quoted below: During any syscall, the kernel may optionally¹ clear TIF_SVE and discard the vector state except for the FPSIMD subset.

"Discarding" means there isn't still architectural state that user-space can be expecting to read later. It will read zeros for parts of vector registers outside the low 128 bits.

Comments in the file you linked describe the design. When SVE isn't being used, it uses cheaper FP/ASIMD context switching. Many processes won't use SVE at all because it's still pretty new, so it definitely makes sense to have this even on hardware that does support SVE.

Specifically https://elixir.bootlin.com/linux/v5.10.205/source/arch/arm64/kernel/fpsimd.c#L229

/*
 * TIF_SVE controls whether a task can use SVE without trapping while
 * in userspace, and also the way a task's FPSIMD/SVE state is stored
 * in thread_struct.
 *
 * The kernel uses this flag to track whether a user task is actively
 * using SVE, and therefore whether full SVE register state needs to
 * be tracked.  If not, the cheaper FPSIMD context handling code can
 * be used instead of the more costly SVE equivalents.
 *
 *  * TIF_SVE set:
 *
 *    The task can execute SVE instructions while in userspace without
 *    trapping to the kernel.
 *
 *    When stored, Z0-Z31 (incorporating Vn in bits[127:0] or the
 *    corresponding Zn), P0-P15 and FFR are encoded in in
 *    task->thread.sve_state, formatted appropriately for vector
 *    length task->thread.sve_vl.
 *
 *    task->thread.sve_state must point to a valid buffer at least
 *    sve_state_size(task) bytes in size.
 *
 *    During any syscall, the kernel may optionally clear TIF_SVE and
 *    discard the vector state except for the FPSIMD subset.
 *
 *  * TIF_SVE clear:
 *
 *    An attempt by the user task to execute an SVE instruction causes
 *    do_sve_acc() to be called, which does some preparation and then
 *    sets TIF_SVE.
 *
 *    When stored, FPSIMD registers V0-V31 are encoded in
 *    task->thread.uw.fpsimd_state; bits [max : 128] for each of Z0-Z31 are
 *    logically zero but not stored anywhere; P0-P15 and FFR are not
 *    stored and have unspecified values from userspace's point of
 *    view.  For hygiene purposes, the kernel zeroes them on next use,
 *    but userspace is discouraged from relying on this.
 *
 *    task->thread.sve_state does not need to be non-NULL, valid or any
 *    particular size: it must not be dereferenced.
 *
 *  * FPSR and FPCR are always stored in task->thread.uw.fpsimd_state
 *    irrespective of whether TIF_SVE is clear or set, since these are
 *    not vector length dependent.
 */

The key part being bits [max : 128] for each of Z0-Z31 are logically zero but not stored anywhere in the TIF_SVE clear state - that's why it's zeroing stuff when leaving that state.

Also, another comment says:

An attempt by the user task to execute an SVE instruction causes do_sve_acc() to be called, which does some preparation and then sets TIF_SVE.

So before that, SVE regs might have stale garbage from another process, but we need to prevent data leaks. Same reason fresh pages from mmap(MAP_ANONYMOUS) are zeroed.

Also same reason execve zeros the integer registers (and non-SVE SIMD registers). The ABI allows garbage, but for security the kernel chooses fixed values, and zero is convenient.

Other comments like https://elixir.bootlin.com/linux/v5.10.205/source/arch/arm64/kernel/fpsimd.c#L501 have similar stuff. This seems like a lazy init of SVE state, on the assumption that most processes won't use SVE at all. So instead of always allocating and zeroing space for it on execve, do it on first use in the task, which triggers this trap.

I know Linux used to do lazy context-switching for x86 SIMD/FP state, but now only does "eager" context switching, and has for so long that support for "lazy" has been dropped. On x86-64, pretty much every process will use some SSE instructions in compiler generated code and in library functions like strlen and memcpy, so most timeslices would involve a trap if the kernel left vector/FP instructions disabled.

It looks like that's the same for AArch64 FP/ASIMD, only eager is implemented. do_fpsimd_acc is a stub that just warns, because nothing ever calls it. (It still tries to avoid unnecessary swaps when just changing current context inside the kernel, only restoring when actually returning to user-space if the values in regs aren't the values for the user-space context we're returning to. But it doesn't leave ASIMD instructions set to trap on first use.)

AArch64 SVE on the other hand is quite new, and not widely available, so many programs might not use it at all. (Unless libc detects and uses it.) This isn't lazy context-switching for SVE for processes that are using it, only lazy init on first use. (Or on use after a system call if the kernel guessed that it might be done with SVE for a while.)

All the comments are consistent with the idea that do_sve_acc is only called to migrate state from FP/ASIMD to SVE on the first use of SVE, when there is no existing state. e.g. before do_sve_acc itself:

/*
 * Trapped SVE access
 *
 * Storage is allocated for the full SVE state, the current FPSIMD
 * register contents are migrated across, and TIF_SVE is set so that
 * the SVE access trap will be disabled the next time this task
 * reaches ret_to_user.
 *
 * TIF_SVE should be clear on entry: otherwise, fpsimd_restore_current_state()
 * would have disabled the SVE access trap for userspace during
 * ret_to_user, making an SVE access trap impossible in that case.
 */
void do_sve_acc(unsigned int esr, struct pt_regs *regs)
{
...

Footnote 1: "may optionally"?

I don't know if there's a heuristic or if in practice it always chooses to reset back to FPASIMD whenever possible. Having syscall-clobbered extended vector regs seems like a good design; most vectorized code wants lots of big vector regs for loops that don't involve system-calls or function-calls, maybe keeping some vector constants around between loops in the same function but usually without a system call in between. In a rare function that did make a system call between loops, perhaps futex for synchronization, code would have to assume SVE regs were destroyed, so either reload SVE constants or only vectorize with ASIMD.

The standard AArch64 calling convention does have some call-preserved vector regs, allowing some scalar or 128-bit vector values to stay in registers across user-level function calls, too. (e.g. https://godbolt.org/z/vrsn5n6d7). But I'm assuming the upper parts (SVE state beyond 128 bits) are call-clobbered so even if the kernel didn't clear uppers on system calls, you could only take advantage of it by manually inlining a system call in asm, not letting a C compiler call a libc wrapper function that follows the user-space calling convention.

So the cost is in the time it takes to trap the next time SVE is used. The kernel might try to notice that a process is frequently causing these traps, and/or that it spends little time in non-SVE context switches, and decide not to reset back to FPASIMD state on future system calls. That could avoid the worst-case situations for an always-reset strategy.

For many processes, SVE is never used, pure win to only do FPASIMD context switching, no traps. (But resetting from SVE to FPASIMD wouldn't be needed either.) For threads that don't make system calls and spend all their time doing SVE number crunching, they won't make any of these traps.

An adaptive strategy would be good for threads that have some phases of heavy use of SVE, but other phases of not using SVE, like running non-vectorized code or only ASIMD. (Like perhaps the SVE code was only in a library function, and other phases of computation don't use that library.) But only if they have a system call between phases. For threads that's probably not rare if they sleep and wait for notification from other threads. And in fact right as a thread goes to sleep is a great place to clear its SVE state, unless it's about to use SVE when it wakes up.

I'm just speculating here; hopefully there's some profiling data to back up whatever strategy Linux actually uses. It may change over time if glibc starts using SVE in strlen and memcmp for example, so more tasks will use SVE every timeslice. (If they don't do that already?)