Why do alignment restrictions change the behaviour of clang while vectorising?

Can anyone explain the behaviour of clang?

Array4Complex64 f1(Array4Complex64 a, Array4Complex64 b){
    return a * b;
}

This function calculates for each element in a and b the complex product. I compiled it twice once with alignment restrictions on the type Array4Complex64 and once without. the results are the following:

with alignment:

f1(Array4Complex64, Array4Complex64):               # @f1(Array4Complex64, Array4Complex64)
    push    rbp
    mov     rbp, rsp
    and     rsp, -32
    sub     rsp, 32
    mov     rax, rdi
    vmovapd ymm0, ymmword ptr [rbp + 80]
    vmovapd ymm1, ymmword ptr [rbp + 112]
    vmovapd ymm2, ymmword ptr [rbp + 16]
    vmovapd ymm3, ymmword ptr [rbp + 48]
    vmulpd  ymm4, ymm1, ymm3
    vfmsub231pd     ymm4, ymm0, ymm2        # ymm4 = (ymm0 * ymm2) - ymm4
    vmovapd ymmword ptr [rdi], ymm4
    vmulpd  ymm0, ymm3, ymm0
    vfmadd231pd     ymm0, ymm1, ymm2        # ymm0 = (ymm1 * ymm2) + ymm0
    vmovapd ymmword ptr [rdi + 32], ymm0
    mov     rsp, rbp
    pop     rbp
    vzeroupper
    ret

without:

f1(Array4Complex64, Array4Complex64):               # @f1(Array4Complex64, Array4Complex64)
    mov     rax, rdi
    vmovupd ymm0, ymmword ptr [rsp + 72]
    vmovupd ymm1, ymmword ptr [rsp + 104]
    vmovupd ymm2, ymmword ptr [rsp + 8]
    vmovupd ymm3, ymmword ptr [rsp + 40]
    vmulpd  ymm4, ymm1, ymm3
    vfmsub231pd     ymm4, ymm0, ymm2        # ymm4 = (ymm0 * ymm2) - ymm4
    vmovupd ymmword ptr [rdi], ymm4
    vmulpd  ymm0, ymm3, ymm0
    vfmadd231pd     ymm0, ymm1, ymm2        # ymm0 = (ymm1 * ymm2) + ymm0
    vmovupd ymmword ptr [rdi + 32], ymm0
    vzeroupper
    ret

The result is the same, but the addresses are calculated differently: once relative to rbp and once relative to rsp. It is not exclusive to multiplication and applies to any calculation. Is one version better than the other?

Solution

The first way is uselessly aligning RSP (and setting up RBP as a frame pointer in the process, so naturally it uses it). This is obviously a missed optimization since it's not actually spilling any of those function args to the stack. (Since this is not a debug build). You could report this to http://bugs.llvm.org/. Include a MCVE of source code, and this asm output.

Both ways are unfortunately passing by value in stack memory, not YMM registers. :( x86-64 System V can pass a struct in a YMM register if it's all FP and if it's 32 bytes in size. (You might have to manually break your 64-byte args into two separate 32-byte args to pass efficiently, if you can't get this to inline, and you can't use AVX-512 to let it be passed in a single ZMM register.)

The caller is responsible for aligning RSP when passing an aligned object by value on the stack; stuff the callee does with RSP doesn't affect the address where the data is, and it's no copying it.

So clearly this is making this specific function less efficient, but this function is so tiny you should definitely make sure this inlines into ever call site at which point all this overhead disappears. (Or at least the cost of unnecessarily aligning RSP is done once for a larger function, not once per call to a tiny function.)

The caller is having to run at least 4 vmovapd store instructions, and more than that around the call site, especially because x86-64 System V has no call-preserved ymm (or even xmm) registers. So any other FP / vector variables or temporaries that were live in registers have to get spilled around a call to this function. And the call overhead itself takes the front-end some time, and some static code size.

The amount of code at every call-site for a non-inline version of this is probably similar to the amount of code it would take to just inline it! This means inlining is pure win.