Search code examples
c++x86simdintrinsicsavx2

What is the correct way to fill a __m128i parameter, from basic type (such as short), to use with _mm256_broadcast_epi (such as _mm_broadcastw_epi16)


All the four _mm256_broadcastb_epi8, _mm_broadcastw_epi16, _mm256_broadcastd_epi32 and _mm256_broadcastq_epi64 functions are intrinsics for VPBROADCASTB, VPBROADCASTW, VPBROADCASTD and VPBROADCASTQ instructions accordingly. According Intel's documentation: "Intel® Advanced Vector Extensions Programming Reference", those instructions may receive a 8-bit, 16-bit 32-bit, 64-bit memory location accordingly.
Page 5-230:

The source operand is 8-bit, 16-bit 32-bit, 64-bit memory location or the low 8-bit, 16-bit 32-bit, 64-bit data in an XMM register

However, the intrinsic API (of Intel, MSVS and gcc) for those instructions receives a __m128i parameter. Now if i have a variable of basic type, supposedly 'short', what is the most efficient and cross-platform way (At least between MSVS and gcc) to pass that variable to the according broadcast intrinsic (_mm_broadcastw_epi16 in case of short)?

For Example:

void func1(uint8_t v) {
    __m256i a = _mm256_broadcastb_epi8(<convert_to__m128i>(v));
    ...
}

void func1(uint16t v) {
    __m256i a = _mm256_broadcastw_epi16(<convert_to__m128i>(v));
    ...
}

void func1(uint32_t v) {
    __m256i a = _mm256_broadcastd_epi32(<convert_to__m128i>(v));
    ...
}

void func1(uint64_t v) {
    __m256i a = _mm256_broadcastq_epi64(<convert_to__m128i>(v));
    ...
}

What should be the <convert_to__m128i> so it is most efficient and cross-platform (if possible)?

For MSVS for example one can do:

void func1(uint16t v) {
    __m128i vt;
    vt.m128_u16[0] = v;
    __m256i a = _mm256_broadcastw_epi16(vt);
    ...
}

But without optimizations it can first load a xmm register and only then use it in VPBROADCASTW. When with optimizations it may use the memory location of v directly. It is also only valid for MSVS.


Solution

  • There are already sequence/compound intrinsics which do exactly what you want:

    _mm256_set1_epi8/16/32/64
    

    From Intels intrinsics guide:

    Broadcast 8-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastb.

    Using those you then should be able to trust the compiler to generate the optimal code.

    I use the Intel Intrinsics Guide when doing stuff like this which is helpful as you can reverse search from a mnemonic (in this case you knew you eventually wanted vpbroadcastb) and it'll tell you which intrinsics are related to it.