can someone explain to me why we moved the value in rax
to rdi
in the main function @0x6f5
, then copy the value at rdi
to the stack of the get_v
and then moved it back to rax @0x6c8
?. Perhaps it is a convention of x86-64, but I didn't understand its logic.
main:
0x00000000000006da <+0>: push rbp
0x00000000000006db <+1>: mov rbp,rsp
0x00000000000006de <+4>: sub rsp,0x10
0x00000000000006e2 <+8>: mov rax,QWORD PTR fs:0x28
0x00000000000006eb <+17>: mov QWORD PTR [rbp-0x8],rax
0x00000000000006ef <+21>: xor eax,eax
0x00000000000006f1 <+23>: lea rax,[rbp-0xc]
=>0x00000000000006f5 <+27>: mov rdi,rax
0x00000000000006f8 <+30>: call 0x6c0 <get_v>
0x00000000000006fd <+35>: mov eax,0x0
0x0000000000000702 <+40>: mov rdx,QWORD PTR [rbp-0x8]
0x0000000000000706 <+44>: xor rdx,QWORD PTR fs:0x28
0x000000000000070f <+53>: je 0x716 <main+60>
0x0000000000000711 <+55>: call 0x580
0x0000000000000716 <+60>: leave
0x0000000000000717 <+61>: ret
get_v
0x00000000000006c0 <+0>: push rbp
0x00000000000006c1 <+1>: mov rbp,rsp
0x00000000000006c4 <+4>: mov QWORD PTR [rbp-0x8],rdi
=>0x00000000000006c8 <+8>: mov rax,QWORD PTR [rbp-0x8]
0x00000000000006cc <+12>: mov DWORD PTR [rax],0x2
0x00000000000006d2 <+18>: mov rax,QWORD PTR [rbp-0x8]
0x00000000000006d6 <+22>: mov eax,DWORD PTR [rax]
0x00000000000006d8 <+24>: pop rbp
0x00000000000006d9 <+25>: ret
This is unoptimized code. There are a lot of instructions here that are redundant and make very little sense, so I'm not sure why you've fixed on the particular indicated one. Consider the instructions immediately preceding it:
xor eax,eax
lea rax,[rbp-0xc]
First, RAX
is cleared (instructions that operate on the lower 32-bits of a 64-bit register implicitly clear the upper bits, so xor reg32, reg32
is equivalent and slightly more optimal than xor reg64, reg64
), then RAX
is loaded with a value. There was absolutely no reason to clear RAX
first, so the first instruction could have been altogether elided.
In this code:
lea rax,[rbp-0xc]
mov rdi,rax
RAX
is loaded, and then its value is copied into RDI
. This makes sense if you need the same value in both RAX
and RDI
, but you don't. The value just needs to be in RDI
in preparation for the function call. (The System V AMD64 calling convention passes the first integer parameter in the RDI
register.) So this could have simply been:
lea rdi, [rbp-0xc]
but, again, this is unoptimized code. The compiler is prioritizing fast code generation and the ability to set breakpoints on individual (high-level language) statements over the generation of efficient code (which takes longer to produce and is harder to debug).
The cyclical spill-reload from the stack in get_v
is another symptom of unoptimized code:
mov QWORD PTR [rbp-0x8],rdi
mov rax,QWORD PTR [rbp-0x8]
None of this is required. It's all just busy work, a common calling card of unoptimized code. In an optimized build, or hand-written assembly, it would have been written simply as a register-to-register move, e.g.:
mov rax, rdi
You'll see that GCC always follows the pattern you've observed in unoptimized builds. Consider this function:
void SetParam(int& a)
{
a = 0x2;
}
With -O0
(optimizations disabled), GCC emits the following:
SetParam(int&):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 2
nop
pop rbp
ret
Look familiar?
Now enable optimizations, and we get the more sensible:
SetParam(int&):
mov DWORD PTR [rdi], 2
ret
Here, the store is done directly into the address passed in the RDI
register. No stack frame needs to be set up or torn down. In fact, the stack is bypassed altogether. Not only is the code much simpler and easier to understand, it is also much faster.
Which serves as a lesson: when you are trying to analyze a compiler's object-code output, always enable optimization. Studying unoptimized builds is largely a waste of time, unless you are actually interested in how the compiler generates unoptimized code (e.g., because you're writing or reverse-engineering the compiler itself). Otherwise, what you care about is optimized code because it is simpler to understand and much more real-world.
Your entire get_v
function could be simply:
mov DWORD PTR [rdi], 0x2
mov eax, DWORD PTR [rdi]
ret
There's no reason to use the stack, shuffling values back and forth. There's no reason to reload the data from the address RBP-8
, since we already have that value loaded into RDI
.
But actually, we can do even better than this, since we are moving a constant into the address stored in RDI
:
mov DWORD PTR [rdi], 0x2
mov eax, 0x2
ret
In fact, this is exactly what GCC generates for what I imagine is your get_v
function:
int get_v(int& a)
{
a = 0x2;
return a;
}
Unoptimized:
get_v(int&):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 2
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
pop rbp
ret
Optimized:
get_v(int&):
mov DWORD PTR [rdi], 2
mov eax, 2
ret