Cortex-M4F lazy FPU stacking

I'm writing threading code for a Cortex M4F. Everything's working and I'm now looking into making FPU context switching more efficient via lazy stacking.

I've read ARM's AN298 and I implemented the alternative approach based on disabling FPU and handling UsageFault, but the lower (S0-S15) registers are not being saved/restored correctly by the hardware. I think the problems lies in figure 11:

According to this, when PendSV runs FPCAR should point to the space reserved in Task A's stack. But as I see it, since CONTROL.FPCA is high in Task C, FPCAR will be updated to point to Task C's stack when entering PendSV. If so, S0-S15 and FPSCR will be saved to Task C's stack instead of Task A's, which is of course not correct.

Am I missing something here, or is the appnote wrong?

One a side note, I checked some open source RTOSes. FreeRTOS and mbed RTOS always stack S16-S31 during the context switch, resulting in automatic S0-S15 stacking, i.e. they make use of lazy stacking only to reduce interrupt latency but do full state preservation for tasks (as in the first approach outlined in the appnote). The TNKernel port for M4F uses the UsageFault approach, but fully saves/restores S0-S31 via software, effectively bypassing any problem with FPCAR (at the cost of 48 load/stores instead of 32, the 16 hardware ones get overwritten on restore). Nobody seems to be using the UsageFault approach while only preserving S16-S31.

(By the way, this is also posted at ARM Community, but a lot of questions seem to go unanswered there. If I get an answer there, I'll replicate it here, too)

Solution

It took a while, but in the end I found out how to do this as efficiently as possible.

First off, the appnote is wrong. My initial explanation on the way FPCAR is updated is right. Note that FPCAR is updated even when the FPU is disabled. Also, by testing, I determined FPCAR to indeed always point to the interrupted stack.

My first approach was to manipulate FPCAR, LSPACT and EXC_RETURN, along with the UsageFault pending PendSV. Of course to do this it's essential that FPCAR manipulation doesn't count as an FPU operation from a lazy stacking perspective. When the documentation is lacking, we can only hack the answers out of the CPU...

LDR  R2, =0xE000EF38
LDR  R3, =0xDEADBEEF
STR  R3, [R2]
VSTM R1, {S16-S31}
UDF

FPCAR is at 0xE000EF38. VSTM is part of the context-saving routine. The idea is that, if FPCAR manipulation is an FPU op, lazy stacking will halt the FPCAR store and will succeed since FPCAR is still valid. This will fault on UDF. Otherwise, lazy stacking will happen on VSTM with a corrupted FPCAR, resulting in a bus fault.

Indeed, I got a bus fault. Yay! I repeated the test with a valid address: no fault, works perfectly. So saving is simple enough. Restoring requires pending PendSV and manipulating FPCAR, LSPACT and EXC_RETURN inside it to cause S0-S15 for the current thread to be restored on exception return. The problem here is that you can't keep state for the current thread on its stack, as it's going to be popped off. Copying is inefficient, so the best bet is to point FPCAR to the persistent TCB state instead of saving the CPU-generated one.

This is getting quite complex, it requires to perform a PendSV after the UsageFault, and it has quite some corner cases and races. There's a better way.

The approach I ended up using runs completely inside UsageFault and bypasses hardware stacking, without losing efficiency over it. After enabling the FPU and determining an FPU context switch is required, I:

Set LSPACT to zero;
Save/restore the full S0-S31 state to/from the TCB;
Set LSPACT back to one.

By doing this, I can work on the whole S0-S31 state without lazy stacking getting on the way, because the CPU thinks it has already stacked the context since LSPACT is zero. This of course relies on the UsageFault handler not using FPU ops outside of save/restore and not being preempted by FPU-using ISRs, which are pretty trivial assumptions given it's hand-coded ASM and fault handlers can't be preempted by ISRs. I also tried disabling lazy stacking via ASPEN/LSPEN instead of working on LSPACT, but it doesn't seem to work (it still triggers lazy stacking, verified by setting an invalid FPCAR).

Efficiency-wise, this is as efficient as hardware stacking. If I wanted to nitpick, it saves one cycle as I don't need to writeback the incremented pointer.

By the way, I included the first approach even though I didn't end up using it because I think it has some useful info in there, if anyone else comes looking for this.