Search code examples
cintrinsicsavx512

AVX-512: _mm512_load vs. standard pointer casting?


In my testing the following code seems to execute fine:

double* ptr = _aligned_malloc(sizeof(double) * 8, 64);
__m512d* vect = (__m512d*)ptr;

However, AVX provides functions to do the exact same thing - _mm512_load_pd. Is the code above considered dangerous in any way? I am assuming the only difference between the standard pointer casting and the intrinsic is that the intrinsic will immediately load the data onto a 64 byte register while the pointer casting will wait for a further instruction to do so. Am I correct?


Solution

  • I am assuming the only difference between the standard pointer casting and the intrinsic is that the intrinsic will immediately load the data onto a 64 byte register while the pointer casting will wait for a further instruction to do so.

    Nope, not at all. They're exactly identical, no diff in generated asm. On most compilers, _mm512_load_pd is just a plain inline function that does something like return *(__m512d *) __P; - that's an exact copy-paste from GCC's headers. So a load intrinsic is literally already doing this.

    __m512d is not fundamentally different from double or int in terms of how the compiler does register allocation, and decides when to actually load C objects that happen to be in memory. The compiler can fold a load into a later ALU instruction (or optimize it away) regardless of how you write it. (And with AVX-512, may be able to fold _mm512_set1_pd(x) broadcast-loads for instructions with a matching element width.)

    The _mm*_load[u]_* intrinsics may look like you're asking for a separate load instruction at that point, but that's not really what happens. That just makes your C look more like asm if you want it to.

    Just like memcpy between two int objects can be optimized away or done when it's convenient (as long as the result is as-if it were done in source order), so can store/load intrinsics depending on how you use them. And just like a + operator doesn't have to compile to an add instruction, _mm_add_ps doesn't necessarily have to compile to addps with those exact operands, or to addps at all.

    Load/store intrinsics basically exist to communicate alignment guarantees to the compiler (via loadu/storeu), and to take care of types for you (at least for ps and pd load[u]/store[u]; integer still requires casting the pointer). Also for AVX-512, to allow masked loads and masked stores.

    Is the code above considered dangerous in any way?

    No. Plain dereference is still strict-aliasing safe because __mm* types are special. See
    Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?