c++assembly visual-c++x86-64 micro-optimization

Advantage of using LEA over MOV for passing parameters in Assembly compiled from C++

I am experimenting with the way parameters are passed to a function when compiling C++ code. I tried to compile the following C++ code using the x64 msvc 19.35/latest compiler to see the resulting assembly:

#include <cstdint>

void f(std::uint32_t, std::uint32_t, std::uint32_t, std::uint32_t);

void test()
{
    f(1, 2, 3, 4);
}

and got this result:

void test(void) PROC
        mov     edx, 2
        lea     r9d, QWORD PTR [rdx+2]
        lea     r8d, QWORD PTR [rdx+1]
        lea     ecx, QWORD PTR [rdx-1]
        jmp     void f(unsigned int,unsigned int,unsigned int,unsigned int)
void test(void) ENDP

Result on godbolt.org

What I do not understand is why did the compiler chose to use lea instead of a simple mov for this example. I understand the mechanics of lea and how it results in the correct values in each register, but I would have expected something more straightforward like:

void test(void) PROC
        mov     ecx, 1
        mov     edx, 2
        mov     r8d, 3
        mov     r9d, 4
        jmp     void f(unsigned int,unsigned int,unsigned int,unsigned int)
void test(void) ENDP

Moreover, from my little understanding of how modern CPUs work, I have the feeling that the version using lea would be slower since it adds a dependency between the lea instructions and the mov instruction.

clang and gcc both gives the result I expect, i.e., 4x mov.

Solution

MSVC's code is smaller than the naive mov approach. (But as you point out, because of the dependency, it may potentially be slower; you would have to test that.)

     1                                          bits 64
     2 00000000 BA02000000                      mov     edx, 2
     3 00000005 448D4A02                        lea     r9d, QWORD [rdx+2]
     4 00000009 448D4201                        lea     r8d, QWORD [rdx+1]
     5 0000000D 8D4AFF                          lea     ecx, QWORD [rdx-1]
     6                                  
     7 00000010 B901000000                      mov     ecx, 1
     8 00000015 BA02000000                      mov     edx, 2
     9 0000001A 41B803000000                    mov     r8d, 3
    10 00000020 41B904000000                    mov     r9d, 4

mov ecx, 1 is 5 bytes: one byte for the opcode B8-BF which also encodes the register, and 4 bytes for the 32-bit immediate. In particular, unlike for some arithmetic instructions, there is no option for mov to encode a smaller immediate with fewer bytes using zero- or sign-extension.

lea ecx, [rdx-1] is 3 bytes. One byte for the opcode; one MOD R/M byte which encodes the destination register ecx and the base register rdx for the effective address of the memory operand; and (here is the key) one byte for an 8-bit sign-extended displacement.

The instructions using r8,r9 need one extra byte for a REX prefix; but that's true for both mov and lea so it's a wash.