assembly x86 gdb x86-64 reverse-engineering

Why copy the same value to rax that he already has?

can someone explain to me why we moved the value in rax to rdi in the main function @0x6f5, then copy the value at rdi to the stack of the get_vand then moved it back to rax @0x6c8?. Perhaps it is a convention of x86-64, but I didn't understand its logic.

 main:
   0x00000000000006da <+0>:     push   rbp
   0x00000000000006db <+1>:     mov    rbp,rsp
   0x00000000000006de <+4>:     sub    rsp,0x10
   0x00000000000006e2 <+8>:     mov    rax,QWORD PTR fs:0x28
   0x00000000000006eb <+17>:    mov    QWORD PTR [rbp-0x8],rax
   0x00000000000006ef <+21>:    xor    eax,eax
   0x00000000000006f1 <+23>:    lea    rax,[rbp-0xc]
 =>0x00000000000006f5 <+27>:    mov    rdi,rax
   0x00000000000006f8 <+30>:    call   0x6c0 <get_v>
   0x00000000000006fd <+35>:    mov    eax,0x0
   0x0000000000000702 <+40>:    mov    rdx,QWORD PTR [rbp-0x8]
   0x0000000000000706 <+44>:    xor    rdx,QWORD PTR fs:0x28
   0x000000000000070f <+53>:    je     0x716 <main+60>
   0x0000000000000711 <+55>:    call   0x580
   0x0000000000000716 <+60>:    leave  
   0x0000000000000717 <+61>:    ret    

 get_v
   0x00000000000006c0 <+0>:     push   rbp
   0x00000000000006c1 <+1>:     mov    rbp,rsp
   0x00000000000006c4 <+4>:     mov    QWORD PTR [rbp-0x8],rdi
 =>0x00000000000006c8 <+8>:     mov    rax,QWORD PTR [rbp-0x8]
   0x00000000000006cc <+12>:    mov    DWORD PTR [rax],0x2
   0x00000000000006d2 <+18>:    mov    rax,QWORD PTR [rbp-0x8]
   0x00000000000006d6 <+22>:    mov    eax,DWORD PTR [rax]
   0x00000000000006d8 <+24>:    pop    rbp
   0x00000000000006d9 <+25>:    ret

Solution

This is unoptimized code. There are a lot of instructions here that are redundant and make very little sense, so I'm not sure why you've fixed on the particular indicated one. Consider the instructions immediately preceding it:

xor    eax,eax
lea    rax,[rbp-0xc]

First, RAX is cleared (instructions that operate on the lower 32-bits of a 64-bit register implicitly clear the upper bits, so xor reg32, reg32 is equivalent and slightly more optimal than xor reg64, reg64), then RAX is loaded with a value. There was absolutely no reason to clear RAX first, so the first instruction could have been altogether elided.

In this code:

lea    rax,[rbp-0xc]
mov    rdi,rax

RAX is loaded, and then its value is copied into RDI. This makes sense if you need the same value in both RAX and RDI, but you don't. The value just needs to be in RDI in preparation for the function call. (The System V AMD64 calling convention passes the first integer parameter in the RDI register.) So this could have simply been:

lea   rdi, [rbp-0xc]

but, again, this is unoptimized code. The compiler is prioritizing fast code generation and the ability to set breakpoints on individual (high-level language) statements over the generation of efficient code (which takes longer to produce and is harder to debug).

The cyclical spill-reload from the stack in get_v is another symptom of unoptimized code:

mov    QWORD PTR [rbp-0x8],rdi
mov    rax,QWORD PTR [rbp-0x8]

None of this is required. It's all just busy work, a common calling card of unoptimized code. In an optimized build, or hand-written assembly, it would have been written simply as a register-to-register move, e.g.:

mov    rax, rdi

You'll see that GCC always follows the pattern you've observed in unoptimized builds. Consider this function:

void SetParam(int& a)
{
    a = 0x2;
}

With -O0 (optimizations disabled), GCC emits the following:

SetParam(int&):
    push    rbp
    mov     rbp, rsp
    mov     QWORD PTR [rbp-8], rdi
    mov     rax, QWORD PTR [rbp-8]
    mov     DWORD PTR [rax], 2
    nop
    pop     rbp
    ret

Look familiar?

Now enable optimizations, and we get the more sensible:

SetParam(int&):
    mov     DWORD PTR [rdi], 2
    ret

Here, the store is done directly into the address passed in the RDI register. No stack frame needs to be set up or torn down. In fact, the stack is bypassed altogether. Not only is the code much simpler and easier to understand, it is also much faster.

Which serves as a lesson: when you are trying to analyze a compiler's object-code output, always enable optimization. Studying unoptimized builds is largely a waste of time, unless you are actually interested in how the compiler generates unoptimized code (e.g., because you're writing or reverse-engineering the compiler itself). Otherwise, what you care about is optimized code because it is simpler to understand and much more real-world.

Your entire get_v function could be simply:

mov   DWORD PTR [rdi], 0x2
mov   eax, DWORD PTR [rdi]
ret

There's no reason to use the stack, shuffling values back and forth. There's no reason to reload the data from the address RBP-8, since we already have that value loaded into RDI.

But actually, we can do even better than this, since we are moving a constant into the address stored in RDI:

mov   DWORD PTR [rdi], 0x2
mov   eax, 0x2
ret

In fact, this is exactly what GCC generates for what I imagine is your get_v function:

int get_v(int& a)
{
    a = 0x2;
    return a;
}

Unoptimized:

get_v(int&):
    push    rbp
    mov     rbp, rsp
    mov     QWORD PTR [rbp-8], rdi
    mov     rax, QWORD PTR [rbp-8]
    mov     DWORD PTR [rax], 2
    mov     rax, QWORD PTR [rbp-8]
    mov     eax, DWORD PTR [rax]
    pop     rbp
    ret

Optimized:

get_v(int&):
    mov     DWORD PTR [rdi], 2
    mov     eax, 2
    ret