Calling Convention Discrepancy in x86_64 Assembly

So I have a assembly routine with 3 parameters ASM_Method(void*, void*, int) and init_method(float, int*). The ones of interest are the void pointers at the former.

When I call the method from the C++ file with the parameters as:

float src[64];
float dest[64];
int radius[3];

init_method(1.5, radius);
ASM_Method(src, dest, 64);

Disassembly of this calling process:

mov         r8d,100h  
lea         rdx,[rbp+0A0h]  
lea         rcx,[rbp-60h]  
call        ASM_Method

Initialized or not, the program works fine. HOWEVER, when I do:

float* src = new float[64];
float* dest = new float[64];
int radius[3];

init_method(1.5, radius);
ASM_Method(src, dest, 64);

When called, RCX is set to a value that is NOT the correct address, but RDX is correct. The program crashes as a result.

Disassembly of this calling process:

mov         r8d,100h  
mov         rdx,rbx  
mov         rcx,rdi  
call        ASM_Method

Unless I initialize src to some values, RCX changes to an invalid address (in this case, 1) when called.

Assembly code for ASM_Method:

mov rax, rdx
add rax, r8
shr r8, 4
inc r8
xor r9, r9
movdqu xmm1, [rax]

MainLoop:
movdqu xmm0, [rcx + r9]
movdqu [rdx + r9], xmm0
add r9, 16
dec r8
jnz MainLoop

movdqu [rax], xmm1

ret

Assembly code for init_method:

mulss xmm0, xmm0
mov ecx, 4
cvtsi2ss xmm1, ecx
mulss xmm0, xmm1

shr ecx, 2
cvtsi2ss xmm2, ecx
addss xmm2, xmm0
sqrtss xmm2, xmm2

stmxcsr roundFlags
or roundFlags, 2000h
ldmxcsr roundFlags

cvtss2si ecx, xmm2

stmxcsr roundFlags
and roundFlags, 0DFFFh
ldmxcsr roundFlags

mov eax, ecx
dec eax
bt ecx, 0
cmovnc ecx, eax

mov eax, 3
cvtsi2ss xmm1, eax
mulss xmm0, xmm1

cvtsi2ss xmm3, ecx
movss xmm2, xmm3
movss xmm4, xmm3

mulss xmm2, xmm2
mulss xmm2, xmm1

mov eax, 12
cvtsi2ss xmm1, eax
mulss xmm3, xmm1

mov eax, -4
cvtsi2ss xmm1, eax
mulss xmm4, xmm1
addss xmm4, xmm1

mov eax, 9
cvtsi2ss xmm1, eax

subss xmm0, xmm2
addss xmm3, xmm1
subss xmm0, xmm3
divss xmm0, xmm4

cvtss2si eax, xmm0

mov esi, ecx
add esi, 2

mov edi, ecx
cmp eax, 0
cmovle edi, esi
shr edi, 1
mov dword ptr [edx], edi

mov edi, ecx
cmp eax, 1
cmovle edi, esi
shr edi, 1
mov dword ptr [edx + 4], edi

mov edi, ecx
cmp eax, 2
cmovle edi, esi
shr edi, 1
mov dword ptr [edx + 8], edi

ret

What is going on?

Solution

I'd [still!] like the full disassembly of case 2. But, I'll take a guess.

(1) The compiler fills rdi with a value [the correct one]. It is the address of src [probably from the new and/or malloc].

In the MS ABI, rdi is considered "non-volatile". It must be preserved by a callee

(2) Case 2 then calls init_method. But, init_method does not preserve rdi [as it must]. It uses it for its own purpose (e.g. edi). So, upon return, rdi has been trashed!

(3) When the program returns from init_method, the compiler expects that rdi will have the same value it had after step (1). (i.e.) The compiler has no knowledge that init_method corrupted rdi, so it uses its value to set rcx [the first argument to ASM_Method]. This should be the src value but it's actually whatever value init_method set it to (i.e. a junk value, relatively speaking)

UPDATE:

The ABI is different for various platforms [usually, just the compiler]. gcc and clang have a different calling convention than MS (i.e. MS is the odd duck or usual suspect). For example, with gcc/clang, rdi holds the first argument and is volatile

Here's the wiki link that should highlight most of the ABIs: https://en.wikipedia.org/wiki/X86_calling_conventions

UPDATE #2:

But why does one refer to the stack (i.e float src[64]) yet the other refers to registers (new float[64])before calling?

Because of compiler optimization. To explain, we'll "turn off" optimization for a bit.

All function scoped variables have a "reserved slot" in the function's stack frame. All these "slots" have a fixed offset within the stack frame that is known to [is computed by] the compiler. If the function has a stack frame at all [some leaf functions can elide it], then all variables have their slots, regardless if optimization is being used or not. Hold that thought ...

When you have a fixed size array as in case 1, the entire space (i.e. data) for that array is within the frame. So, the address of the given array is the frame pointer + the array's offset. Hence, the lea rcx,[rbp + offset_of_src]

Scalar variables have slots, too. That includes things like "pointers to arrays", which is what we have in case 2.

[Remember, optimization is off for the moment] Part of the missing code in case 2 was something like [simplified]:

// allocate src
call malloc
mov [ebp + offset_of_src],rax

// allocate dest
call malloc
mov [ebp + offset_of_dest],rax

// push arguments for init_method and call it
call init_method

// call ASM_Method
mov r8d,64
mov edx,[ebp + offset_of_dest]
mov ecx,[ebp + offset_of_src]
call ASM_Method

Notice, here, we don't want to "push" the address of the pointer variable, we want to "push" the contents of the pointer variable.

Now, let's turn the optimizer back on. Just because a function variable has a slot on the stack frame doesn't mean that the generated code is obligated to use it. For a simple function as in case 2, the optimizer realizes that it can use non-volatile registers to store the src and dest values and can eliminate stack access/storage for them.

So, with optimization, case 2 looks like:

// allocate src
call malloc
mov rdi,rax

// allocate dest
call malloc
mov rsi,rax

// push arguments for init_method and call it
call init_method

// call ASM_Method
mov r8d,64
mov edx,rsi
mov ecx,rdi
call ASM_Method

The particular non-volatiles selected by the compiler are arbitrary. In this instance, they just happened to be rsi and rdi but there are others to choose from.

The compiler/optimizer is quite clever about selecting these registers and others to hold data values. It can see when a given function no longer needs the value in the register and can reassign it to hold another [unrelated] value if it chooses.

Okay, remember the "hold that thought"? Time to exhale. Normally, once a variable is given a register assignment, the compiler tries to leave it alone until it's no longer needed. But, sometimes, there aren't enough registers to hold all active variables at one time.

For example, if a function has [say] four nested for loops and uses 20 different variables, there aren't enough registers to go around. So, the compiler may have to generate code that "dumps" a value in a register back to the stack frame slot for the corresponding variable. This is a "register spill".

That's why there's always a slot in the stack frame for a scalar, even if it's never used [due to optimizing the value to a register]. It keeps the compilation process simpler and the offsets the same.

Also, we were talking about callee saved registers. But, what about caller saved registers. While most functions push non-volatiles upon entry and pop them at exit (i.e. they are preserving the non-volatiles for their caller).

A given function (e.g. A) may use a volatile register to hold something (e.g. r10) for a variable (e.g.) sludge. If it calls another function (e.g. B), B might trash A's value.

So, if A wishes to preserve a value in r10 across a call to B, A must save it, call B, and then restore it:

mov [rbp + offset_of_sludge],r10
call B
mov r10,[rbp + offset_of_sludge]

So, it's handy to have a stack frame slot available.

Sometimes, the function has so many variables that the code generated for some of them looks like the non-optimized version:

mov rax,[rbp + offset_of_foo]
add rax,rdx
sub rax,rdi
mov [rbp + offset_of_foo],rax

because foo access/usage is too infrequent to merit a non-volatile register assignment