I'm studying x86-64 NASM and here is current situation:
For first, I wrote somewhat straight, easy-to-read code. However, I found somewhat "clever" ways to initialize registers to reduce instruction length.
I want to know whether these clever things could bring real reward, or do more harm than good.
This is the first code with straight way:
.loop:
mov rax, -1
mov rdx, 1 ; **
mov rsi, 2 ; **
; ... loop body
dec rcx
jnz .loop
(**: The assembler actually emitted these lines as mov edx, 1
and mov esi, 2
. Later I found that the assembler optimized them for me because writing EDX/ESI will zero-out the upper 32 bits of RDX/RSI.)
These are 17 bytes of beginning and 5 bytes of ending.
This is the second code with clever way:
.loop:
xor eax, eax
dec rax
lea edx, [rax+2] ; ***
lea esi, [rdx+1] ; ***
; ... loop body
loop .loop
(***: I tried various combinations of 32-bit / 64-bit registers and these had the shortest instruction length.)
These are 11 bytes of beginning and 2 bytes of ending.
Whether it's a good idea to do this or not depends on your objective. Usually, it is not a good idea.
If your objective is ease of understanding, you should avoid these tricks as they make your code harder to understand.
If your objective is code size reduction, it might indeed be a good idea to make use of such tricks. You can do even better than you already did though; for example, you could do or rax, -1
to set rax
to -1
with only 4 bytes. Or push -1
followed by pop rax
for only 3 bytes.
However, usually the objective is performance. Now when you optimise for performance, some tricks help, but others are detrimental. In particular, all the tricks you showed us in your question are detrimental to performance:
clearing and then decrementing a register takes just as long as or is a bit slower than setting the register to -1
directly, depending on microarchitecture. I would avoid it anyway as two instructions take up more decoder bandwidth than one instruction.
deriving registers from other registers rather than setting them directly does not take more time per se, but as you introduce a dependency on the other register, these initialisations must now be performed after the other register is set rather than in parallel. This can reduce performance on out-of-order architectures and should be avoided, but sometimes it may still be beneficial. Design your code such that as many operations as possible can be done in parallel.
the loop
instruction is well-known to be a slow one and should be avoided. But so should dec
followed by a conditional branch: as dec
performs a partial flag update, a penalty exists on some microarchitectures if the flags are read subsequently. Use sub rcx, 1
instead if you want to evaluate the flag result.
Note that when optimising for performance, occasionally it might still be a good idea to optimise for size. This is because longer code sequences take more space in the instruction cache, blocking other code from being cached. In big programs whose hot code paths do not entirely fit into L1 instruction cache, performance can benefit from code size optimisations, especially in cold paths that are rarely executed. However, this is a tricky thing to evaluate and strategies must be adapted to the case at hand. Let benchmarks guide your decisions in any case.