Search code examples
assemblyx86-64nasmmicro-optimization

Is it "too clever" for using LEA to load constant to register?


I'm studying x86-64 NASM and here is current situation:

  • These codes are for education only, not for running on client-facing system or so.
  • RCX holds loop count, between 1 and 1000.
  • At the beginning of each loop, I initialize RAX = -1, RDX = 1, and RSI = 2.
  • loop body contains ~30 instructions.

For first, I wrote somewhat straight, easy-to-read code. However, I found somewhat "clever" ways to initialize registers to reduce instruction length.

I want to know whether these clever things could bring real reward, or do more harm than good.

This is the first code with straight way:

.loop:
    mov rax, -1
    mov rdx, 1 ; **
    mov rsi, 2 ; **

    ; ... loop body

    dec rcx
    jnz .loop

(**: The assembler actually emitted these lines as mov edx, 1 and mov esi, 2. Later I found that the assembler optimized them for me because writing EDX/ESI will zero-out the upper 32 bits of RDX/RSI.)

These are 17 bytes of beginning and 5 bytes of ending.

This is the second code with clever way:

.loop:
    xor eax, eax
    dec rax
    lea edx, [rax+2] ; ***
    lea esi, [rdx+1] ; ***

    ; ... loop body

    loop .loop

(***: I tried various combinations of 32-bit / 64-bit registers and these had the shortest instruction length.)

These are 11 bytes of beginning and 2 bytes of ending.


Solution

  • Whether it's a good idea to do this or not depends on your objective. Usually, it is not a good idea.

    If your objective is ease of understanding, you should avoid these tricks as they make your code harder to understand.

    If your objective is code size reduction, it might indeed be a good idea to make use of such tricks. You can do even better than you already did though; for example, you could do or rax, -1 to set rax to -1 with only 4 bytes. Or push -1 followed by pop rax for only 3 bytes.

    However, usually the objective is performance. Now when you optimise for performance, some tricks help, but others are detrimental. In particular, all the tricks you showed us in your question are detrimental to performance:

    • clearing and then decrementing a register takes just as long as or is a bit slower than setting the register to -1 directly, depending on microarchitecture. I would avoid it anyway as two instructions take up more decoder bandwidth than one instruction.

    • deriving registers from other registers rather than setting them directly does not take more time per se, but as you introduce a dependency on the other register, these initialisations must now be performed after the other register is set rather than in parallel. This can reduce performance on out-of-order architectures and should be avoided, but sometimes it may still be beneficial. Design your code such that as many operations as possible can be done in parallel.

    • the loop instruction is well-known to be a slow one and should be avoided. But so should dec followed by a conditional branch: as dec performs a partial flag update, a penalty exists on some microarchitectures if the flags are read subsequently. Use sub rcx, 1 instead if you want to evaluate the flag result.

    Note that when optimising for performance, occasionally it might still be a good idea to optimise for size. This is because longer code sequences take more space in the instruction cache, blocking other code from being cached. In big programs whose hot code paths do not entirely fit into L1 instruction cache, performance can benefit from code size optimisations, especially in cold paths that are rarely executed. However, this is a tricky thing to evaluate and strategies must be adapted to the case at hand. Let benchmarks guide your decisions in any case.