Is there an AVX2 instruction (and intrinsic) to broadcast load a 16 bit value 16 times into an __m256i?

In the following code, I can use avx2 to count the number of 1 bits in each position separately 16 bits at a time, but there are 4 missing instructions on the lines labelled loadLow16. I need an instruction that loads a 16 bit value and puts it in each 16 bits of the avx2 register (16 times). Is there an instruction to do this, or is there a better way to do this?

void countHistBits4(const uint64_t p[], uint32_t n, uint32_t hist[64]) {
  uint16_t masks[16] = {1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1<<7,
            1<<8, 1<<9, 1<<10, 1<<11, 1<<12, 1<<13, 1<<14, 1<<16};
    __m256i mask = _mm256_load_si256((__m256*)masks);
    __m256i count1 = _mm256_setzero_si256();
    __m256i count2 = _mm256_setzero_si256();
    __m256i count3 = _mm256_setzero_si256();
    __m256i count4 = _mm256_setzero_si256();
    for (uint32_t i = 0; i < n; i++) {
      __m256i v1 = loadLow16(p[i] & 0xFFFF);
      __m256i v2 = loadLow16((p[i] >> 16) & 0xFFFF);
      __m256i v3 = loadLow16((p[i] >> 32) & 0xFFFF);
      __m256i v4 = loadLow16((p[i] >> 48) & 0xFFFF);
      v1 = _mm256_and_si256(v1, mask);
      count1 = _mm256_adds_epi16 (count1, vals);
      v2 = _mm256_and_si256(v2, mask);
      count2 = _mm256_adds_epi16 (count2, vals);
      v3 = _mm256_and_si256(v3, mask);
      count3 = _mm256_adds_epi16 (count3, vals);
      v4 = _mm256_and_si256(v4, mask);
      count4 = _mm256_adds_epi16 (count4, vals);
    }
}

Solution

For your overall positional-popcount problem, see https://github.com/mklarqvist/positional-popcount for heavily optimized implementations, which are also correct unlike this, which you obviously haven't had time to debug yet since you were missing a building block. Adding multiple x & (1<<15) results in an int16_t element is going to saturate right away, so you'd need something, perhaps a variable-count shift or a compare like x & mask == mask. Or probably better a total redesign: Related SO Q&As:

AVX2 column population count algorithm over each bit-column separately
Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2
https://github.com/mklarqvist/positional-popcount - at least as good as anything in the SO Q&As, I think.

The title question: broadcast a `uint16_t`

The instruction is vpbroadcastw. It works with a memory or xmm source. On Intel CPUs, it decodes to a load and a shuffle (port 5) uop, unlike 32, 64, or 128-bit broadcasts which are handled purely in the load port.

The intrinsics for it are:

__m256i _mm256_set1_epi16( int16_t ) - if you only have a scalar.
__m256i _mm256_broadcastw_epi16 (__m128i a) - to broadcast the bottom element of a vector.

To avoid violating the strict-aliasing rule in C, you're correct that accessing uint64_t p[] elements and masking them is a safe approach, while point a uint16_t * at it wouldn't be. (If you deref it normally; but unfortunately there's no load intrinsic that hides the deref inside an aliasing-safe intrinsic, so you'd have to memcpy into a uint16_t tmp var or something...)

Modern GCC is smart enough to compile __m256i v4 = _mm256_set1_epi16((p[i] >> 48) & 0xFFFF); into vpbroadcastw ymm0, WORD PTR [rdi+6+rdx*8], not doing anything stupid like an actual 64-bit scalar shift and then vmovd + xmm-source broadcast. (even with only -Og https://godbolt.org/z/W6o5hKTbz)

But that's when only using one of the counts, with the others optimized away. (I just used a volatile __m256i sink to assign things to as a way to stop the optimizer removing the loop entirely.)

https://godbolt.org/z/fzs9PEbMq shows with heavier optimization, using count2 and count4 gets GCC to do a scalar load of the uint64_t and break it up with two separate scalar shifts, before vmovd xmm0, edx / ... / vmovd xmm0, eax. So that's quite bad. :/

     // compiles to a vpbroadcastw load with an offset
     // but violates strict aliasing
      __m256i v2 = _mm256_set1_epi16( *(1 + (uint16_t*)&p[i])  );

To make that safe, you could use memcpy into a temporary, or GNU C __attribute__((may_alias)). (The same attribute is used in the definition of __m256i itself).

typedef uint16_t aliasing_u16 __attribute__((aligned(1), may_alias));

      __m256i v1 = _mm256_set1_epi16(*(0 + (aliasing_u16*)&p[i]));
      __m256i v2 = _mm256_set1_epi16(*(1 + (aliasing_u16*)&p[i]));
      __m256i v3 = _mm256_set1_epi16(*(2 + (aliasing_u16*)&p[i]));
      __m256i v4 = _mm256_set1_epi16(*(3 + (aliasing_u16*)&p[i]));

Compiles with 4x vpbroadcastw loads (https://godbolt.org/z/6v9esqK9P). (Instructions using those loads elided)

        vpbroadcastw    ymm1, WORD PTR [rdi]
...
        add     rdi, 8
        vpbroadcastw    ymm1, WORD PTR [rdi-6]
...
        vpbroadcastw    ymm1, WORD PTR [rdi-4]
...
        vpbroadcastw    ymm1, WORD PTR [rdi-2]
...

This is probably better to avoid bottlenecks on port 5 on Intel CPUs. Both vmovd xmm, eax and vpbroadcastw ymm,xmm are 1 uop that can only run on port 5 on Skylake-family CPUs. (https://agner.org/optimize/ https://uops.info/).

vpbroadcastw with a memory source still needs a shuffle uop (p5), but getting the data from elsewhere into the SIMD domain uses a load port instead of another port 5 uop. And it can micro-fuse the load into a single front-end uop.

Is there an AVX2 instruction (and intrinsic) to broadcast load a 16 bit value 16 times into an __m256i?

The title question: broadcast a uint16_t

The title question: broadcast a `uint16_t`