Search code examples
c++vectorizationcompiler-optimizationstrict-aliasingauto-vectorization

Why do compilers miss vectorization here?


Consider the following valarray-like class:

#include <stdlib.h>

struct va
{
    void add1(const va& other);
    void add2(const va& other);

    size_t* data;
    size_t  size;
};

void va::add1(const va& other) {
    for (size_t i = 0; i < size; ++i) {
        data[i] += other.data[i];
    }
}

void va::add2(const va& other){
    for (size_t i = 0, s = size; i < s; ++i) {
        data[i] += other.data[i];
    }
}

The add2 function is vectorized for different compilers (MSVC, Clang, GCC, ICC), whereas add1 is not. See https://godbolt.org/z/c61qvrrbv

This is explained by potential aliasing: compilers cannot prove that one of elements pointed by data is not the size itself.

However, there's also potential overlap of elements pointed by data and other.data. For MSVC there is potential aliasing these elements and pointers themselves, as it does not take advantage of the Strict Aliasing Rule. This applies to both add1 and add2.

The compilers are performing checks for all possible aliasing they suspect and execute vectorized operation for add2.

Why aren't they adding more checks and having this optimization for add1?


Solution

  • It looks like compilers aren't able to realize (and don't add code to check) that data[i] never points to this->size. This would make the loop trip-count not calculable ahead of the first iteration, which is something no mainstream compiler except ICC can ever handle.

    Hopefully compilers can learn to check for possible overlap before applying their vectorization logic, like (.data > this) || (.data+.size) < this; hopefully there's an efficient way to do that. They already invent code to check for overlap between two arrays in add2.

    (The more checking code is required at startup, the more profitable vectorization has to be to pay for itself; 64-bit scalar elements aren't that bad with baseline x86-64 compared to 128-bit vectors especially when the compiler doesn't know from PGO that size is usually not small and that the loop is hot. I tried gcc -march=icelake-client and -march=znver4 to not only enable AVX2 but also set tuning heuristics for CPUs with very good vector throughput and cache/memory bandwidth. But still no joy, so this possible aliasing is probably a full roadblock, not a heuristic decision.)


    Asm evidence for compilers being worried about aliasing this->size

    Notice how GCC's loop branch is cmp rax, QWORD PTR [rdi+8], where rax holds i and [rdi+8] is this->size (x86-64 SysV calling convention), so it's reloading it every iteration. If we compile with -O3 -fno-tree-vectorize, we see GCC's add2 loads the size into a register ahead of the loop, comparing against that inside the loop, i.e. hoisting the load. The fact that GCC doesn't do that with add1 is a pretty clear sign that it thinks data[i] += ... can modify this->size.

    # GCC's add1 inner loop with -O3   -march=icelake-client
    .L3:
            mov     rcx, QWORD PTR [rsi+rax*8]
            add     QWORD PTR [rdx+rax*8], rcx
            inc     rax
            cmp     rax, QWORD PTR [rdi+8]
            jb      .L3
    

    Also, changing the type to unsigned *data; or anything that can't point to a size_t lets GCC, Clang, and ICC auto-vectorize of add1. Using -fno-strict-aliasing disables vectorization again. (And makes compilers extra "paranoid", reloading this->data and other.data every iteration, like MSVC was always doing. Also breaking vectorization of add2 for those compilers.)

    Changing the pointer type doesn't help MSVC because it doesn't do type-based aliasing analysis; it always acts like gcc -fno-strict-aliasing. MSVC's add2 already does check for more than just overlap of the pointed-to arrays, I think; probably some of that extra cmp/jcc is checking that this->data[i] += ... won't change the .data pointer in this or other.


    std::vector doesn't have size_t members, usually avoiding this

    std::vector<size_t> wouldn't have this problem (at least in non-MSVC compilers) because type-based aliasing knows that size_t * can't point at another pointer. std::vector normally stores three pointers: .begin, .end, and end_of_capacity, so the size information is end-begin, not a member storing size directly.


    For iterating over one array, normally it's at least as efficient to increment a pointer like for (... ; ptr < endp ; ptr++) *ptr so you're not using indexed addressing modes. That's presumably why std::vector is normally laid out that way, rather than a pointer and two size_t members.

    Some RISC machines don't even have two-register indexed addressing modes. For iterating two arrays, some CPUs will do better with fewer instructions just incrementing one index instead of two pointer increments, but it depends on the microarchitecture, e.g. some x86 CPUs un-laminate add reg, [reg + reg] into 2 uops in the back-end, not keeping it micro-fused, especially with 3-operand AVX instructions.

    An efficient way (in asm) to loop over two arrays on CPUs with indexed addressing is to address one array relative to the other. It's UB to do this in C++, and would obfuscate your code, so it's something compilers should do for you. sub rsi, rdi outside the loop (subtract pointers), then the loop body can be mov eax, [rsi + rdi] (second array = difference + first) / add [rdi], eax (first array) / add rdi, 8 (increment the pointer which is also the index for the other array.)

    MSVC will actually do this optimization which other compilers haven't picked up yet. (Godbolt)

    // Compilers still optimize without __restrict, but it gets rid of the noise of extra checking
    void foo(int *__restrict a, int *__restrict b){
        for (int i=0 ; i<10240; i++){
            a[i] += b[i];
        }
    }
    
    void foo(int * __restrict,int * __restrict) PROC                              ; foo, COMDAT
            lea     rax, QWORD PTR [rcx+32]
            sub     rdx, rcx       ;;;; <---- Pointer subtraction
            mov     ecx, 320                      ; 00000140H
            npad    4
    $LL4@foo:
            vmovdqu ymm1, YMMWORD PTR [rax-32]            ;; unrolled with 4 vectors
            vpaddd  ymm1, ymm1, YMMWORD PTR [rdx+rax-32]
            vmovdqu YMMWORD PTR [rax-32], ymm1
    
            vmovdqu ymm2, YMMWORD PTR [rax]
            vpaddd  ymm1, ymm2, YMMWORD PTR [rdx+rax]
            vmovdqu YMMWORD PTR [rax], ymm1
    
            vmovdqu ymm1, YMMWORD PTR [rax+32]
            vpaddd  ymm1, ymm1, YMMWORD PTR [rdx+rax+32]
            vmovdqu YMMWORD PTR [rax+32], ymm1
    
            vmovdqu ymm1, YMMWORD PTR [rax+64]
            vpaddd  ymm1, ymm1, YMMWORD PTR [rdx+rax+64]
            vmovdqu YMMWORD PTR [rax+64], ymm1
    
            lea     rax, QWORD PTR [rax+128]
            sub     rcx, 1
            jne     SHORT $LL4@foo
            vzeroupper
            ret     0
    

    Unfortunately, MSVC got it backwards and is using the two-register addressing mode as the memory source operand for vpaddq. It'll un-laminate at issue/rename into the ROB on Intel Sandybridge-family, including at least Skylake, probably some later. But vpaddd ymm1, ymm1, [rdx] would be 1 uop. The pure load vmovdqu would always be 1 uop even with an indexed addressing mode.

    Indexed stores aren't ideal either (the store-address uop can't run on port 7 on Haswell / Skylake), and MSVC does avoid that. But it could get the best of both worlds by doing the pure load from b[] with an indexed addressing mode, and then memory-source vpadd + store with the simple addressing mode like [rdx+32].

    So MSVC did save some code-size and is getting some benefit in back-end throughput by needing only one increment of loop overhead, and in AGU port contention so this can run at close to 1 vector per clock with L1d cache hits to let it do 2 loads + 1 store every cycle (Intel's optimization manual suggests that Skylake can't fully sustain that for 32-byte load/store, for some unknown reason),

    With an indexed addressing mode for the store like GCC and clang use, it could only run at 1 vector per 1.5 cycles on Haswell / Skylake. (Ice Lake has two load AGUs and two separate store AGUs, avoiding this bottleneck, but I don't know if unlamination of indexed addressing modes is still a thing on Ice Lake or Alder Lake.)

    I don't know what's up with MSVC preferring lea for incrementing a pointer. Or for preferring sub ecx/jne instead of comparing against an end-pointer with lea before the loop instead of mov. Then the end of the loop could be cmp rax, r8/jne or something, which would fuse into a single uop on AMD not just Intel.