Arm Assembly for RPI3 b+ why make xor on register for counter?

i was trying to make a program to blink a RPI3 b+ with Armv7 Assembly and notice that it wasn't working using this code for the delay function

delay:
    b loop

loop:
    add r10, r10, #1
    cmp r10, r4
    bne loop
    beq return

return:
    mov r10, #0
    bx lr

r10 is the register used for the counter and r4 contains that r10 needs to reach to stop and get back to the main code. After looking a tutorial I've found that they make a xor operation for the counter register, I've added the correction and now the code looks like this.

delay:
    eor r10, r10, r10
    b loop

loop:
    add r10, r10, #1
    cmp r10, r4
    bne loop
    beq return

return:
    mov r10, #0
    bx lr

I've compiled, loaded it into the rpi3 and now it works, but why I had to add that line, I know what a xor gate those, but if the two inputs are equal, It'll return the exact same value. What is the sense of this operation?

Solution

TL:DR: XOR same,same is similar to sub same,same, producing zero.

This tutorial is not good, and neither is XOR-zeroing on ARM, or any RISC ISA. Only use it in x86 asm (and 8080), not in asm for other ISAs, and not in high-level languages.

but if the two inputs are equal, It'll return the exact same value.

No, that would be regular non-exclusive OR. XOR gives you the bits that were different. When both inputs are the same, the result is 0.

XOR-zeroing is good only on x86. (See What is the best way to set a register to zero in x86 assembly: xor, mov or and? for details why). None of those reasons apply on ARM: mov reg, #0 is the same size in machine code as eor reg,reg,reg, so there was no historical reason to support EOR as a "zeroing idiom" that's special cased by modern CPUs.

(This is true even in Thumb code, although in that case you want movs reg, #0 for the smaller encoding, at least with r0-r7. r8-r14 need a 4-byte Thumb2 encoding regardless of setting flags or not.)

In fact an ARM CPU isn't even architecturally allowed to optimize eor dst, same,same to break the false dependency, because memory dependency-ordering rules require EOR and other operations to carry a dependency. (e.g. for using the result of a std::memory_order_consume load.) Not that they'd bother spending transistors and power on it, since there's no reason for ARM machine code to use that in the first place when mov reg, #0 works perfectly well.

So eor r10, r10, r10 is clearly worse than mov r10, #0.

Never use it unless you want a 0 that has a dependency on the old value of R10. If you don't know what that means, you don't want it; it would only be useful in multithreaded code on a load result like a data_ready flag, or in microbenchmark experiments to test out-of-order scheduling, or latency vs. throughput by generating a constant value with a data dependency on some result.

On x86 it saved a byte of machine-code size vs. mov ax, 0, and 3 bytes in 32-bit mode, so real world code used it everywhere. Later CPUs evolved to make it still efficient even with out-of-order execution, where reading the old value of the register as an input would otherwise be a problem. (Unlike with mov reg, 0 which we expect not to have a false dependency even without any special support. mov is always dependency-breaking; the special casing of xor same,same on x86 merely makes it equal in that way. xor-zeroing is better in other ways on x86.)

This "tutorial" was clearly written as a learning exercise by another beginner (which is common for random tutorials you find on the Internet; it's a lot of work to write a good one).

It's not a an example of good efficient code, given that bug (missing zeroing a loop counter) and two useless b next_instruction instructions. Execution falls through to the next instruction anyway even if you don't b or beq return.

Most conditional branches should just be a compare and one branch, with the other path of execution being the fall-through. It's somewhat of an anti-pattern for beginner code to put another branch with the opposite condition one after the other. Or to make the bottom of a loop an while(1) { if(cond)break } instead of just do{}while(cond); - in your loop at least the useless branch is outside the loop. But it's a delay loop that exists only to waste time anyway, so really it's just wasting code size and changing the cycles-per-count delay factor.

If you need execution to go somewhere else in both cases (i.e. both possible targets are after other code that should fall through into it), then the second branch should be an unconditional b. And you should never write a branch that jumps to the next instruction in source order, because execution would go there anyway even if there was no branch.