In the following code, I can use avx2 to count the number of 1 bits in each position separately 16 bits at a time, but there are 4 missing instructions on the lines labelled loadLow16. I need an instruction that loads a 16 bit value and puts it in each 16 bits of the avx2 register (16 times). Is there an instruction to do this, or is there a better way to do this?
void countHistBits4(const uint64_t p[], uint32_t n, uint32_t hist[64]) {
uint16_t masks[16] = {1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1<<7,
1<<8, 1<<9, 1<<10, 1<<11, 1<<12, 1<<13, 1<<14, 1<<16};
__m256i mask = _mm256_load_si256((__m256*)masks);
__m256i count1 = _mm256_setzero_si256();
__m256i count2 = _mm256_setzero_si256();
__m256i count3 = _mm256_setzero_si256();
__m256i count4 = _mm256_setzero_si256();
for (uint32_t i = 0; i < n; i++) {
__m256i v1 = loadLow16(p[i] & 0xFFFF);
__m256i v2 = loadLow16((p[i] >> 16) & 0xFFFF);
__m256i v3 = loadLow16((p[i] >> 32) & 0xFFFF);
__m256i v4 = loadLow16((p[i] >> 48) & 0xFFFF);
v1 = _mm256_and_si256(v1, mask);
count1 = _mm256_adds_epi16 (count1, vals);
v2 = _mm256_and_si256(v2, mask);
count2 = _mm256_adds_epi16 (count2, vals);
v3 = _mm256_and_si256(v3, mask);
count3 = _mm256_adds_epi16 (count3, vals);
v4 = _mm256_and_si256(v4, mask);
count4 = _mm256_adds_epi16 (count4, vals);
}
}
For your overall positional-popcount problem, see https://github.com/mklarqvist/positional-popcount for heavily optimized implementations, which are also correct unlike this, which you obviously haven't had time to debug yet since you were missing a building block. Adding multiple x & (1<<15)
results in an int16_t
element is going to saturate right away, so you'd need something, perhaps a variable-count shift or a compare like x & mask == mask
. Or probably better a total redesign: Related SO Q&As:
uint16_t
The instruction is vpbroadcastw
. It works with a memory or xmm source. On Intel CPUs, it decodes to a load and a shuffle (port 5) uop, unlike 32, 64, or 128-bit broadcasts which are handled purely in the load port.
The intrinsics for it are:
__m256i _mm256_set1_epi16( int16_t )
- if you only have a scalar.__m256i _mm256_broadcastw_epi16 (__m128i a)
- to broadcast the bottom element of a vector.To avoid violating the strict-aliasing rule in C, you're correct that accessing uint64_t p[]
elements and masking them is a safe approach, while point a uint16_t *
at it wouldn't be. (If you deref it normally; but unfortunately there's no load intrinsic that hides the deref inside an aliasing-safe intrinsic, so you'd have to memcpy into a uint16_t
tmp var or something...)
Modern GCC is smart enough to compile __m256i v4 = _mm256_set1_epi16((p[i] >> 48) & 0xFFFF);
into vpbroadcastw ymm0, WORD PTR [rdi+6+rdx*8]
, not doing anything stupid like an actual 64-bit scalar shift and then vmovd
+ xmm-source broadcast. (even with only -Og
https://godbolt.org/z/W6o5hKTbz)
But that's when only using one of the counts, with the others optimized away. (I just used a volatile __m256i sink
to assign things to as a way to stop the optimizer removing the loop entirely.)
https://godbolt.org/z/fzs9PEbMq shows with heavier optimization, using count2 and count4 gets GCC to do a scalar load of the uint64_t and break it up with two separate scalar shifts, before vmovd xmm0, edx
/ ... / vmovd xmm0, eax
. So that's quite bad. :/
// compiles to a vpbroadcastw load with an offset
// but violates strict aliasing
__m256i v2 = _mm256_set1_epi16( *(1 + (uint16_t*)&p[i]) );
To make that safe, you could use memcpy
into a temporary, or GNU C __attribute__((may_alias))
. (The same attribute is used in the definition of __m256i
itself).
typedef uint16_t aliasing_u16 __attribute__((aligned(1), may_alias));
__m256i v1 = _mm256_set1_epi16(*(0 + (aliasing_u16*)&p[i]));
__m256i v2 = _mm256_set1_epi16(*(1 + (aliasing_u16*)&p[i]));
__m256i v3 = _mm256_set1_epi16(*(2 + (aliasing_u16*)&p[i]));
__m256i v4 = _mm256_set1_epi16(*(3 + (aliasing_u16*)&p[i]));
Compiles with 4x vpbroadcastw loads (https://godbolt.org/z/6v9esqK9P). (Instructions using those loads elided)
vpbroadcastw ymm1, WORD PTR [rdi]
...
add rdi, 8
vpbroadcastw ymm1, WORD PTR [rdi-6]
...
vpbroadcastw ymm1, WORD PTR [rdi-4]
...
vpbroadcastw ymm1, WORD PTR [rdi-2]
...
This is probably better to avoid bottlenecks on port 5 on Intel CPUs. Both vmovd xmm, eax
and vpbroadcastw ymm,xmm
are 1 uop that can only run on port 5 on Skylake-family CPUs. (https://agner.org/optimize/ https://uops.info/).
vpbroadcastw
with a memory source still needs a shuffle uop (p5), but getting the data from elsewhere into the SIMD domain uses a load port instead of another port 5 uop. And it can micro-fuse the load into a single front-end uop.