assembly cpu-architecture apple-m1 arm64 cpu-registers

Performance advantage of 32bit registers in AArch64?

When doing integer operations in AArch64/ARM64, is there a performance difference when using 32bit W{n} registers versus 64bit X{n} registers?

For example, is add W1, W2, W3 any faster than add X1, X2, X3? Is sdiv W1, W2, W3 faster than sdiv X1, X2, X3? Could it be different depending on implementation (like Apple M1/M2/M3 vs. a 64bit Qualcomm Snapdragon)?

My intuition is there is a minor performance advantage when using W{n}, but I'm not sure whether it actually matters except in tight loops. I'm interested in official ARM documentation talking about this, if there is one. In assembly code I'm currently writing, I'm using mostly X{n} for consistency, but am wondering whether it's worth switching to W{n} when I know/expect the data to fit into 32 bits.

Solution

The links provided by @PeterCordes and comment by @NateEldredge have sent me down some interesting rabbit holes.

tl;dr: For arithmetics like ADD, SUB, LSL and so on, there is no performance difference when using W{n} vs. X{n}. However, there is a slight W{n} advantage when doing udiv/sdiv. Depending on implementation (Cortex vs. M1), the way ldp and stp are called can yield a tiny difference.

Sources:

Arm Cortex-A77 Core Software Optimization Guide
List of M1 instructions and their cycle times and Firestorm Pipeline Overview by Dougall Johnson
M1 Explainer

Cortex-A77

I suspect this probably applies to most AArch64 implementations.

Advantage when using W{n} over X{n}:
- udiv, sdiv: Exec latency of 5 to 12 for W-form, 5 to 20 for X-form.
- There is no difference for multiply accumulate (madd, msub) and thus no difference for "pure" multiply.
Penalty when using W{n} over X{n}:
- ldp, ldnp with signed immediate offset: Throughput of 2 for W-form, but just 1 for X-form.
- stp, stnp with signed immediate offset: Throughput of 2 for W-form, but just 1 for X-form.
- Interestingly, there's no difference when using load/store pair with pre- or post-index. So ldp W0, W1, [SP, #-16] (no exclamation mark) has a penalty, but ldp W0, W1, [SP, #-16]! and ldp W0, W1, [SP], #16 do not!

Apple M1

The implementation of AArch64 by Apple significantly differs from Cortex versions. Here's what I was able to find out:

With load/store pair, the picture is reversed when comparing with Cortex!
- ldnp and stnp have no difference when using W-form or X-form.
- ldp and stp are very slightly faster in W-form for pre- and post-index, but for the signed-offset case they're the same speed.
- ldr and str are also very slightly faster in W-form. The difference seems to be even smaller than with ldp/stp.
Advantage when using W{n}:
- udiv, sdiv: Exec latency of 7 to 8 for W-form, 7 to 9 for X-form.
Penalty when using W{n}:
- mov W0, W1 must be executed, whereas mov X0, X1 is just a register rename internally.
- mov from/to SP is slightly slower with W{n} registers.

Conclusion

As far as I can tell, one of the few cases one might care about W-form vs. X-form is when doing lots of udiv/sdiv on Cortex. On M1, the difference is tiny. Overall, the differences are small when they do exist and I suspect simply don't matter much in real-life code.

The other scenario that occurs to me where the difference might be important is implementing a memcpy with unrolled ldp/stp: on Cortex, doing signed-offset and just one pre- or post-index call could be slightly faster, while on M1 it's probably better to use pre- or post-index calls for all ldp/stp.