Can anyone explain the behaviour of clang?
Array4Complex64 f1(Array4Complex64 a, Array4Complex64 b){
return a * b;
}
This function calculates for each element in a and b the complex product. I compiled it twice once with alignment restrictions on the type Array4Complex64 and once without. the results are the following:
with alignment:
f1(Array4Complex64, Array4Complex64): # @f1(Array4Complex64, Array4Complex64)
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 32
mov rax, rdi
vmovapd ymm0, ymmword ptr [rbp + 80]
vmovapd ymm1, ymmword ptr [rbp + 112]
vmovapd ymm2, ymmword ptr [rbp + 16]
vmovapd ymm3, ymmword ptr [rbp + 48]
vmulpd ymm4, ymm1, ymm3
vfmsub231pd ymm4, ymm0, ymm2 # ymm4 = (ymm0 * ymm2) - ymm4
vmovapd ymmword ptr [rdi], ymm4
vmulpd ymm0, ymm3, ymm0
vfmadd231pd ymm0, ymm1, ymm2 # ymm0 = (ymm1 * ymm2) + ymm0
vmovapd ymmword ptr [rdi + 32], ymm0
mov rsp, rbp
pop rbp
vzeroupper
ret
without:
f1(Array4Complex64, Array4Complex64): # @f1(Array4Complex64, Array4Complex64)
mov rax, rdi
vmovupd ymm0, ymmword ptr [rsp + 72]
vmovupd ymm1, ymmword ptr [rsp + 104]
vmovupd ymm2, ymmword ptr [rsp + 8]
vmovupd ymm3, ymmword ptr [rsp + 40]
vmulpd ymm4, ymm1, ymm3
vfmsub231pd ymm4, ymm0, ymm2 # ymm4 = (ymm0 * ymm2) - ymm4
vmovupd ymmword ptr [rdi], ymm4
vmulpd ymm0, ymm3, ymm0
vfmadd231pd ymm0, ymm1, ymm2 # ymm0 = (ymm1 * ymm2) + ymm0
vmovupd ymmword ptr [rdi + 32], ymm0
vzeroupper
ret
The result is the same, but the addresses are calculated differently: once relative to rbp and once relative to rsp. It is not exclusive to multiplication and applies to any calculation. Is one version better than the other?
The first way is uselessly aligning RSP (and setting up RBP as a frame pointer in the process, so naturally it uses it). This is obviously a missed optimization since it's not actually spilling any of those function args to the stack. (Since this is not a debug build). You could report this to http://bugs.llvm.org/. Include a MCVE of source code, and this asm output.
Both ways are unfortunately passing by value in stack memory, not YMM registers. :( x86-64 System V can pass a struct in a YMM register if it's all FP and if it's 32 bytes in size. (You might have to manually break your 64-byte args into two separate 32-byte args to pass efficiently, if you can't get this to inline, and you can't use AVX-512 to let it be passed in a single ZMM register.)
The caller is responsible for aligning RSP when passing an aligned object by value on the stack; stuff the callee does with RSP doesn't affect the address where the data is, and it's no copying it.
So clearly this is making this specific function less efficient, but this function is so tiny you should definitely make sure this inlines into ever call site at which point all this overhead disappears. (Or at least the cost of unnecessarily aligning RSP is done once for a larger function, not once per call to a tiny function.)
The caller is having to run at least 4 vmovapd
store instructions, and more than that around the call site, especially because x86-64 System V has no call-preserved ymm (or even xmm) registers. So any other FP / vector variables or temporaries that were live in registers have to get spilled around a call to this function. And the call overhead itself takes the front-end some time, and some static code size.
The amount of code at every call-site for a non-inline version of this is probably similar to the amount of code it would take to just inline it! This means inlining is pure win.