I am looking for fast code to perform the following operations on __m256i
and would appreciate help:
int i
(where 0 <= i < 256
)int
, can assume non-zero input)int
, can assume non-zero input)Find, and clear/set/flip-nth, are already solved in earlier Q&As.
Clear lowest set can be done a bit more efficiently than clear_nth(vec, ctz(vec))
. I think it's the only one that has an interesting answer but isn't already answered in other Q&As.
There isn't a bithack for clear highest set, since add/sub carry propagates from low to high, and x86 doesn't have a cheap bit-reverse even for scalar. AVX-512 has vplzcntq
which could allow a bit-clear with vpsrlvq
(0x800...>> n) / vpandn
, but with AVX2 the least-bad option might be clear_nth(vec, 255-clz(vec))
. A good clear_nth already uses vpsllvq
or vd
.
Efficiently find least significant set bit in a large array? - finds the lowest set bit in a non-zero __m256i
after looping to find a non-zero __m256i
. With the opposite direction of bit-scan instructions, the same algorithm can find the highest set bit. (Or I guess 31 - std::countl_zero(mask)
since bsr
is slow on AMD.)
set/clear/flip a bit at a runtime-variable index: set individual bit in AVX register (__m256i), need "random access" operator - @wim's answer using vpsllvd
is quite clever, generating a vector with exactly 1 bit set. SIMD shifts zero the element when the shift count is out-of-range, so set1(count) - setr(0, 32, 64, 96, ...)
produces a vector with only one in-range shift count, with which you can shift a vector of set1(1)
.
Clearing instead of setting the selected bit is just a matter of _mm256_andnot_si256
instead of or
with the 1<<n
vector. Or xor
to flip that bit.
(ermlg's answer has a store/reload store-forwarding stall, but might not be bad for throughput if there's a lot of surrounding code between executions of this, since store-forwarding stalls can't pipeline with each other on SnB-family, but can pipeline with successful store-forwarding.)
clear_nth(v, bitscan_forward/reverse(v))
using the above strategies might be the least-bad way to clear the highest or lowest set bit across a whole __m256i
. I don't think there's a good way to do this, so ideally, design your algorithm to only need this operation in smaller chunks.
With AVX-512 you could test-into-mask (vptestmd k1, ymm0,ymm0
) and kmov
/blsi
/kmov
to isolate the lowest set bit for a merge-masked v &= v-1
(vpaddd
/ vpandd
). So only the element containing the lowest bit would change, since only that bit has a non-zero mask. (There is no ksub
or kneg
to implement mask &= -mask
in two k
instructions; you'd have to use a 2's-complement identity like m &= ~m+1
with knotw
/kaddw
/kandw
with a __mmask16 k1 = 1;
mask constant already set up. Since each k
instruction runs on only one port on Intel, but BMI1 blsi
can even run on port 1 or 5, it's probably better to kmov
around it. And compilers will probably do that regardless of your wishes.)
__m256i
doesn't work like a single 256-bit integer. It's very inconvenient to try to propagate carry across SIMD elements for anything like blsr
with a x &= x-1
bithack, you don't want to try to actually do that.
The search functions are pretty doable, though, using _mm256_cmpeq_epi8
against zero and _mm256_movemask_epi8
/ tzcnt
or 31-lzcnt
(or bsr
on Intel where it's not slow), in parallel with storing the vector to a tmp buffer so you can index the relevant byte and bit-scan it. Or in wider chunks with _mm256_movemask_ps
on _mm256_cmp_epi32
compare masks. See the linked Q&A above.
Also perhaps related building blocks:
Trying to write a vectorized implementation of Gerd Isenberg's Bit Scan Forward as an exercise is tzcnt
aka bsf
on each 16-bit element in parallel (using a De Bruijn sequence).
Doing that in parallel with finding the lowest or highest element with a set bit could perhaps reduce critical path latency vs. store/reload and then bit-scan the element then add. But would probably be worse throughput.
Is there an efficient way to get the first non-zero element in an SIMD register using SIMD intrinsics? is just finding which element, not which bit inside it, so it's just a sub-problem of Efficiently find least significant set bit in a large array?
The AVX-512 strategy I suggested above, of merge-masked v &= v-1
, could work with AVX2 with _mm256_blendv_epi8
instead of merge-masking.
With just 4x qword elements per vector, there are non-terrible ways to generate the mask vector we want from a compare mask, e.g. starting with a 4-bit bitmap of the zero elements from _mm256_movemask_pd( _mm256_castsi256_pd(cmp))
. See is there an inverse instruction to the movemask instruction in intel avx2? - a LUT is an option, especially compressing it to byte elements loaded with vpmovsxbq
to expand int8 -1
/0xFF
to int64, e.g. with _mm_loadu_si32
which might or might not fold into a memory source operand for vpmovsxbq
. (While we have the mask as a scalar, clear its lowest set bit with blsr
, or if using a LUT, make the LUT elements have that already done.)
Or we could consider just doing vector shuffles and OR or ANDNOT to generate a mask that's all-set or all-zero above the lowest non-zero element, as a control for vpblendvb
. The mask in elements that were originally zero doesn't matter, since x &= x-1
stays zero, but we need to keep the original element for non-zero elements other than the lowest. I think there have been some earlier Q&As about this.
Or better, skip the blend and use that mask as part of the bithack. The x & x-1
bithack involves adding a constant -1
, but if we instead added -1
or 0
, the elements where we added zero wouldn't change. x & (x+0)
is just v
. So we need a mask that has a -1
only in the lowest non-zero element (don't-care in any zero elements below that).
On CPUs where vpblendvb
is only 1 uop (Zen but not Intel), this no-blend version would have longer critical path latency by 1 cycle, but still better throughput. (critical path from mask being ready to final result being vpaddq / vpand instead of just a vpblendvb with the bithack running in parallel.) Unless it costs extra bitwise ops to get a mask that's -1
where we want instead of 0
where we want, since x86 before AVX-512 only has SIMD-integer ==
and signed >
, not directly !=
. So we're stuck with getting a 0
in elements that are non-zero, and will need to introduce a -1
somehow into the lowest element.
#include <immintrin.h>
// This naming choice follows BMI1 blsr (Bit Lowest-Set Reset) but for vectors.
__m256i vlsr256(__m256i v)
{
// find the lowest element containing a zero
__m256i zcmp = _mm256_cmpeq_epi64(v, _mm256_setzero_si256());
//unsigned zmask = _mm256_movemask_pd(_mm256_castsi256_pd(zcmp));
// 4-bit mask, 1 where there are zeros
// get a mask that's zero above the lowest non-zero element
/* option 1: use zmask to index a LUT of __m256i lut[16]; (512 bytes)
* option 2: use zmask to index int32_t lut[16] with vpmovsxbq (64 bytes)
* option 3: countr_zeros(~zmask) to load a window of mask bytes or qwords from i8[] = {-1,-1,-1,0,0,0,0}; (7 or 56 bytes)
* option 4 vector-compare for > against _mm256_set1_epi64x(tzcnt(~zmask))
* option 5: shuffle zcmp instead of using movemask at all
*/
// This is option 5, shuffling. 3 shuffles and 2 bitwise ops, but with some ILP
// And the only lane-crossing shuffle is vinserti128, so cheap even on Zen 1.
// zcmp = [ D C B A ]. -1 means that v element was zero, 0 means non-zero. high u64 element on the left, in the direction of left shifts.
// desired output: -1 if this element should update, 0 to keep
// C&B&A B&A A -1
__m128i zcmp_low = _mm256_castsi256_si128(zcmp); // [ B A ]
__m128i ab = _mm_shuffle_epi32(zcmp_low, _MM_SHUFFLE(1,0, 3,2)); // [ A B ]
ab = _mm_and_si128(ab, zcmp_low); // [ A&B A&B ]
__m256i ca = _mm256_unpacklo_epi64(_mm256_set1_epi8(-1), zcmp); // [ C -1 | A -1 ]
__m256i insert = _mm256_inserti128_si256(ca, ab, 1); // [ A&B A&B | A -1 ]
__m256i lowmask = _mm256_and_si256(ca, insert); // [ C&B&A A&B | A -1 ]
// clear the lowest set bit in the selected element with x & (x + -1)
// leave others unmodified with x & (x+0)
__m256i vm1 = _mm256_add_epi64(v, lowmask);
return _mm256_and_si256(v, vm1);
#if 0
__m256i vm1 = _mm256_add_epi64(v, _mm256_set1_epi64x(-1));
__m256i blsr_each = _mm256_and_si256(v, vm1);
return _mm256_blendv_epi8(v, blsr_each, blendmask);
// or reverse the first two operands to blendv if an inverted blendmask is easier to generate
#endif
}
Tested on Godbolt via constant-propagation in clang and looking at the asm comments for the vector return value. Also compiles to sane-looking asm with both GCC and clang.
vlsr256(long long __vector(4)):
vpxor xmm3, xmm3, xmm3 # setzero()
vpcmpeqd ymm2, ymm2, ymm2 # set1(-1); both will be hoisted out of loops
vpcmpeqq ymm3, ymm0, ymm3
vpshufd xmm1, xmm3, 78
vpunpcklqdq ymm2, ymm2, ymm3
vpand xmm1, xmm1, xmm3
vinserti128 ymm1, ymm2, xmm1, 0x1
vpand ymm1, ymm1, ymm2
vpaddq ymm1, ymm1, ymm0
vpand ymm0, ymm1, ymm0
ret
Using https://uica.uops.info/ to count uops and analyze dependencies (not including the ret
or materializing the constants), the critical path latency is only 9 cycles on Intel. (Probably better on AMD where vinserti128
is only 1 cycle, even on Zen 2 and later where vector ALUs and the register file are 256-bit wide).
And it's 8 single-uop instructions for the front-end.
On Skylake, 3 of them (all the shuffles) need port 5 so that's a slightly worse bottleneck than just 8 uops for 3 vector ALU ports. But on Ice Lake and later, 2 of the shuffles can run on p1/p5 so even without surrounding code, back-end port pressure is pretty even.
movemask
/ LUT might be competitive on total uops, and some of them could probably run on port 6 (or on AMD, on the separate scalar ports), not competing with surrounding vector work. If latency matters, a LUT might be close if it hits in cache. This shuffling version doesn't need to load any vector constants from memory; compilers know how to materialize 0
and -1
with a single ALU instruction each.
[ D C B A ] // input
, high element at the left, so left shifts go leftWith zcmp input, A=-1 means original v[0] == 0,
desired output: -1 if this element should update, 0 to keep
C&B&A B&A A -1 or ~A
| B&A B&A // vpshufd + vpand xmm
C -1 | A -1 // vshufps/pd orig with a -1 constant to get both. vpalignr run on fewer ports on Intel
B&A B&A | A -1 // vinserti128 (cheap even on Zen 1)
C&B&A B&A | A -1 // vpand
2 shuffle / AND steps could propagate dependencies if they were inclusive, but we need the non-zero element to have a value different than the one produced there by the compare. So this takes a 3rd shuffle. I didn't see a way to make the extra instruction a bitwise NOT (xor
with -1
) instead of a shuffle, and on modern CPUs in-lane shuffles are not a serious problem if you pick the ones that are cheap (single uop and 1c latency) on both AMD and Intel. I'd been expecting I'd end up using vpermq
at least once, which would have sucked on AMD especially Zen 1, but still 2 uops on AMD before Zen 4. But that was when I was still hoping to get away with only 2 shuffles.
I hadn't initially realized my vshufpd was just taking the low element for each choice, and that that's equivalent to vunpcklpd
. GCC spotted that "optimization" (although it runs on fewer ports on Intel Ice Lake and later. But smaller code size.)
But fortunately there's an equivalent integer shuffle, vpunpcklqdq
which runs fast everywhere (1 cycle latency on AMD even before Zen 4, where the 256-bit FP shuffles were 3c latency although still 1 uop with 0.5c throughput)
Before I realized that, I wrote it this way :P
__m256i ca = _mm256_castpd_si256( // [ C -1 | A -1 ] // vshufpd with casts to keep compilers happy. vpalignr runs on fewer ports on Intel
_mm256_shuffle_pd(
_mm256_castsi256_pd(_mm256_set1_epi8(-1)),
_mm256_castsi256_pd(zcmp),
0b0000)); // no dependency on ab yet
Earlier ideas that didn't pan out, and brainstorming.
D D C C | B B A A // as 32-bit elements
D C C 0 | B A A 0 // vpslldq zcmp, 4
D D&C C 0 | B B&A A 0 // vpand ymm. (or XMM to zero high half?)
C 0 | A 0 // vpshufd or vpslldq orig input, 8 // either is port 1 on ICL/ADL.
C -1 | A -1 // vpor constant, or perhaps vpcmpeqd against something. Or vshufpd with a -1 constant
// possible paddq or psubq with partially-overlapping qwords to OR the MSB of one 32-bit half,
// and/or carry-propagate into the top half where we get another useful result from the same op?
// That's good if we only want blend controls, so don't care about the low bits of each byte, and can afford to shift in some zeros at the bottom if doing -1 + -1 = -2
// Not sure there's any useful combinations.
// initial exploration
D C B A // zcmp input, A=-1 means original v[0] == 0
C 0 A 0 // vpslldq 8
DC C BA A // vpor. Nope, doesn't make the low non-zero element different from the higher ones.
D C B A
C B A A // vpermq
D&~C C&~B B&~A A&~A=0 // vpandn