Search code examples
c++gccclangwebassemblysimd

What is the proper method to load GNU C generic vectors?


GCC/Clang provided vector extensions are a convenient way to enable SIMD vectorisation on multiple architectures (like webassembly, arm64, x64).

With

using v8x16u = uint8_t __attribute__((vector_size(16)));
v8x16u add(v8x16u a, v8x16u b) { return a + b; }
        add     v0.16b, v1.16b, v0.16b
        ret

it easy to do some SIMD programming (even if not too highly performant on some cases).

What I've not been able to find from any documentation is what is the preferred/canonical way to initialise these vectors from memory.

v8x16u load(uint8_t const *p) {
   v8x16u a;
   for (int i = 0; i < 16; i++) a[i] = p[i];
}

does work as expected on clang x64, clang armv8 and little bit unoptimally with GCC 7.3

   ldr q0, [x0]    // clang armv8
   movdqu  xmm0, XMMWORD PTR [rdi]  // gcc x64 7.3
   movups  xmm0, xmmword ptr [rdi]  // clang 15.0.0 x64

   sub     sp, sp, #16 // GCC arm64 7.3
   ldr     q0, [x0]
   add     sp, sp, 16

In webassembly the result is a disaster (a loop). Unrolling the load works for webassembly and clang, but not for GCC, which does 16 individual byte loads

v8x16u load(uint8_t const *p) {
   v8x16u a{p[0],p[1],p[2],p[3],
            p[4],p[5],p[6],p[7],
            p[8],p[9],p[10],p[11],
            p[12],p[13],p[14],p[15]
   };
   return a;
}
        local.get       0
        v128.load       0:p2align=0
        end_function

Finally type punning does compile, but wouldn't it introduce UB -- and also it seems that clang might have a bug in the implementation (as using vs typedef should AFAIK work equally in this case, preserving the attributes)

v8x16u load_fail(uint8_t const *p) {
    using Vec = uint8_t __attribute__((vector_size(16), aligned(1)));
    return *reinterpret_cast<const Vec*>(p);
}
        movaps  xmm0, xmmword ptr [rdi]

v8x16u load_okish(uint8_t const *p) {
    typedef uint8_t Vec __attribute__((vector_size(16)))  __attribute__((aligned(1)));
    return *reinterpret_cast<const Vec*>(p);
        movups  xmm0, xmmword ptr [rdi]
}

Solution

  • Cast a pointer to the vector type with the appropriate aligned attribute, and dereference it. It does not introduce UB when the type of pointed-to memory (its dynamic type, which may be different from declared type — after placement new, or if it is allocated on the heap in the first place) is compatible with the underlying scalar type of the vector. GCC allows aliasing between scalar and vector types, but does not document it yet. For Clang/LLVM it appears to be the same (allows aliasing without documenting). For both compilers the guarantee naturally falls out of how autovectorization internally introduces vector-typed accesses to arrays of scalars.

    Due to the Clang bug noted in the comments, the variant with reduced alignment needs to be introduced with typedef rather than using.

    You may additionally introduce vector variants with the may_alias attribute for use when accessing memory with unknown/arbitrary type (when the scalar implementation would use memcpy or char-based accesses), or use a vector of char to perform the memory access, then bit-cast the vector to the type needed for computations.

    Using memcpy is risky because system headers commonly override it with an inline variant using __builtin_object_size for hardening (the so-called FORTIFY_SOURCE feature), which interferes with optimization, particularly speed/size estimates for inlining. It's possible to avoid that using __builtin_memcpy explicitly, but at that point using custom vector types seems cleaner.