c++compiler-optimization simd memory-alignment auto-vectorization

Why does p1007r0 std::assume_aligned remove the need for epilogue?

My understanding is that vectorization of code works something like this:

For data in array bellow the first address in the array that is the multiple of 128(or 256 or whatever SIMD instructions require) do slow element by element processing. Let's call this prologue.

For data in array between the first address that is multiple of 128 and last one address that is multiple of 128 use SIMD instruction.

For the data between last address that is multiple of 128 and end of array use slow element by element processing. Let's call this epilogue.

Now I understand why std::assume_aligned helps with prologue, but I do not get why it enables compiler to remove epilogue also.

Quote from proposal:

If we could make this property visible to the compiler, it could skip the loop prologue and epilogue

Solution

You can see the effect on code-gen from using GNU C / C++ __builtin_assume_aligned.

gcc 7 and earlier targeting x86 (and ICC18) prefer to use a scalar prologue to reach an alignment boundary, then an aligned vector loop, then a scalar epilogue to clean up any leftover elements that weren't a multiple of a full vector.

Consider the case where the total number of elements is known at compile time to be a multiple of the vector width, but the alignment isn't known. If you knew the alignment, you don't need either a prologue or epilogue. But if not, you need both. The number of left-over elements after the last aligned vector is not known.

This Godbolt compiler explorer link shows these functions compiled for x86-64 with ICC18, gcc7.3, and clang6.0. clang unrolls very aggressively, but still uses unaligned stores. This seems like a weird way to spend that much code-size for a loop that just stores.

// aligned, and size a multiple of vector width
void set42_aligned(int *p) {
    p = (int*)__builtin_assume_aligned(p, 64);
    for (int i=0 ; i<1024 ; i++ ) {
        *p++ = 0x42;
    }
}

 # gcc7.3 -O3   (arch=tune=generic for x86-64 System V: p in RDI)

    lea     rax, [rdi+4096]              # end pointer
    movdqa  xmm0, XMMWORD PTR .LC0[rip]  # set1_epi32(0x42)
.L2:                                     # do {
    add     rdi, 16
    movaps  XMMWORD PTR [rdi-16], xmm0
    cmp     rax, rdi
    jne     .L2                          # }while(p != endp);
    rep ret

This is pretty much exactly what I'd do by hand, except maybe unrolling by 2 so OoO exec could discover the loop exit branch being not-taken while still chewing on the stores.

Thus unaligned version includes a prologue and epilogue:

// without any alignment guarantee
void set42(int *p) {
    for (int i=0 ; i<1024 ; i++ ) {
        *p++ = 0x42;
    }
}

~26 instructions of setup, vs. 2 from the aligned version

.L8:            # then a bloated loop with 4 uops instead of 3
    add     eax, 1
    add     rdx, 16
    movaps  XMMWORD PTR [rdx-16], xmm0
    cmp     ecx, eax
    ja      .L8               # end of main vector loop

 # epilogue:
    mov     eax, esi    # then destroy the counter we spent an extra uop on inside the loop.  /facepalm
    and     eax, -4
    mov     edx, eax
    sub     r8d, eax
    cmp     esi, eax
    lea     rdx, [r9+rdx*4]   # recalc a pointer to the last element, maybe to avoid a data dependency on the pointer from the loop.
    je      .L5
    cmp     r8d, 1
    mov     DWORD PTR [rdx], 66      # fully-unrolled final up-to-3 stores
    je      .L5
    cmp     r8d, 2
    mov     DWORD PTR [rdx+4], 66
    je      .L5
    mov     DWORD PTR [rdx+8], 66
.L5:
    rep ret

Even for a more complex loop which would benefit from a little bit of unrolling, gcc leaves the main vectorized loop not unrolled at all, but spends boatloads of code-size on fully-unrolled scalar prologue/epilogue. It's really bad for AVX2 256-bit vectorization with uint16_t elements or something. (up to 15 elements in the prologue/epilogue, rather than 3). This is not a smart tradeoff, so it helps gcc7 and earlier significantly to tell it when pointers are aligned. (The execution speed doesn't change much, but it makes a big difference for reducing code-bloat.)

BTW, gcc8 favours using unaligned loads/stores, on the assumption that data often is aligned. Modern hardware has cheap unaligned 16 and 32-byte loads/stores, so letting the hardware handle the cost of loads/stores that are split across a cache-line boundary is often good. (AVX512 64-byte stores are often worth aligning, because any misalignment means a cache-line split on every access, not every other or every 4th.)

Another factor is that earlier gcc's fully-unrolled scalar prologues/epilogues are crap compared to smart handling where you do one unaligned potentially-overlapping vector at the start/end. (See the epilogue in this hand-written version of set42). If gcc knew how to do that, it would be worth aligning more often.