assembly arm cpu-architecture micro-optimization

Is it more efficient to touch fewer registers in ARM assembly?

I've just started learning Assembly via Raspbian and have a quick question: how efficient is saving register space in Assembly? For example, if I wanted to do a quick addition, is there a meaningful difference in

mov r1, #5
mov r2, #3
add r1, r1, r2

and

mov r1, #5
mov r2, #3
add r3, r1, r2     @ destination in a new register that wasn't previously used

(except storing in different registers)?

Solution

Using the same register for output as input has no inherent disadvantage on ARM¹. I don't think there's any inherent advantage either, though. Things can get more interesting in the general case when we're talking about writing registers that the instruction didn't already have to wait for (i.e. not inputs).

Use as many registers as you need to save instructions. (Be aware of the calling convention, though: if you use more than r0..r3 you'll have to save/restore the extra ones you use, if you want to call your function from C). Specifically, normally optimize for the lowest dynamic instruction count; doing a bit of extra setup / cleanup to save instructions inside loops is normally worth it.

And not just to save instructions: software pipelining to hide load latency is potentially valuable on pipelined in-order execution CPUs. e.g. if you're looping over an array, load the value you'll need 2 iterations from now into a register, and don't touch it until then. (Unroll the loop). An in-order CPU can only start instructions in order, but they can potentially complete out of order. e.g. a load that misses in cache doesn't stall the CPU until you try to read it when it's not ready. I think you can assume that high-performance in-order CPUs like modern ARMs will have whatever scoreboarding is necessary to track which registers are waiting for an ALU or load result to be ready.

Without actually going full software-pipelining, you can sometimes get similar results by doing a block of loads, then stuff, then a block of stores. e.g. a memcpy optimized for big copies might load 12 registers in its main unrolled loop, then store those 12 registers. So the distance between a load and store of the same register is still large enough to hide L1 cache load latency at least.

Current(?) Raspberry Pi boards (RPi 3+) use ARM Cortex-A53 cores, a 2-wide superscalar in-order microarchitecture.

Any ARM core (like Cortex-A57) that does out-of-order execution will use register renaming to make WAW (write-after-write) and WAR hazards a non-issue. (https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards).

On an in-order core like A53, WAR is definitely a non-issue: there's no way a later instruction can write a register before an earlier instruction has a chance to read its operand from there.

But a WAW hazard could limit the ability of the CPU to run two instructions at once. This would only be relevant when writing a register you didn't already read. add r1, r1, r2 has to wait for r1 to be ready before it can even start executing, because it's an input.

For example, if you had this code, we might actually see a negative performance effect from writing the same output register in 2 instructions that might run in the same cycle. I don't know how Cortex-A53 or any other in-order ARM handles this, but another dual-issue in-order CPU (Intel P5 Pentium from 1993) doesn't pair instructions that write to the same register (Agner Fog's x86 uarch guide). The 2nd one has to wait a cycle before starting (but can maybe pair with the instruction after that).

@ possible WAW hazard
adds  r3, r1, r2      @ set flags, we don't care about the r3 output
add   r3, r1, #5      @ now actually calculate an integer result we want

If you'd used a different dummy output register, these could both start in the same clock cycle. (Or if you'd use cmn r1, r2 (compare-negated), you could have set flags from r1 - (-r2) without writing an output at all, which according to the manual is the same as setting flags from r1 + r2.) But probably there's some case you can come up with that couldn't be replaced with a cmp, cmn, tst (ANDS), or teq (EORS) instruction.

I'd expect that an out-of-order ARM could rename the same register multiple times in the same cycle (OoO x86 CPUs can do that) to fully avoid WAW hazards.

I'm not aware of any microarchitectural benefit to leaving some registers "cold".

On a CPU with register renaming, normally that's done with a physical register file, and even a not-recently-modified architectural register (like r3) will need a PRF entry to hold the value of whatever instruction last wrote it, no matter how long ago that was. So writing a register always allocates a new physical register, and (eventually) frees up the physical register holding the old value. Regardless if the old value was also just written, or if it's had that value for a long time.

Intel P6-family did use a "retirement register file" that holds the retirement state separately from "live" values in the out-of-order back-end. But it kept those live register values right in the ROB with the uop that produced them (instead of a reference to a PRF entry), so it couldn't run out of physical registers for renaming before the back-end was full. See http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ more some interesting x86 CPU experiments measuring ROB vs. PRF limits on out-of-order window size for other x86 CPUs that do use a PRF.

In fact, due to limited read ports on the retirement register file, P6-family (PPro through Nehalem) can actually stall when reading too many registers that haven't been written recently, in one issue group. (See Agner Fog's microarch guide, register read stalls.) But I don't think this is a typical problem on other uarches, like any out-of-order ARM cores. Set up constants / loop invariants in registers outside loops, and freely use them inside.

Footnote 1: this is generally true across all architectures, but there are exceptions. The only one I know if is a pretty special case: on recent Intel x86 CPUs (in 64-bit mode) mov eax, eax (1 cycle latency) is slower than mov ecx, eax (0 cycle latency) for truncating a 64-bit register to 32 bit, because mov-elimination only works between different registers. (Can x86's MOV really be "free"? Why can't I reproduce this at all?)