Given a __m128i
which stores 16 char
s, the even-index lane refers to even lane (i.e., lanes at 0, 2, 4, ..., 14), and odd-index lane refers to odd lane (i.e, lanes at 1, 3, 5, ... 15).
In my application, even/odd lane must be in given ranges. For example, suppose even_min
is 1, even_max
is 7, odd_min
is 5, and odd_max
is 10:
# valid
vec1: [1, 5, 6, 10, 2, 6, 4, 6, 2, 7, 4, 9, 2, 7, 4, 8]
# invalid because 0-th (even) is greater than even_max
vec2: [8, 5, 6, 10, 2, 6, 4, 6, 2, 7, 4, 9, 2, 7, 4, 8]
How to check whether it is valid more efficiently?
My current solution is very straightforward by checking the two comparison results respectively:
__m128i even_min = _mm_set1_epi8(xxx);
__m128i even_max = _mm_set1_epi8(xxx);
__m128i even_mask =
_mm_set_epi8(0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1);
__m128i evenRange = _mm_and_si128(_mm_cmpge_epi8(vec, even_min),
_mm_cmple_epi8(vec, even_max));
bool isEvenOk = _mm_testc_si128(evenRange, even_mask);
// the code for checking odd bytes is similar
Note that to compare unsigned chars using inclusive condition, two macros are defined as following:
#define _mm_cmpge_epi8(a, b) _mm_cmpeq_epi8(_mm_max_epu8(a, b), a)
#define _mm_cmple_epi8(a, b) _mm_cmpge_epi8(b, a)
Make one vector of per-lane min
values, and another of per-lane max
es. For example, _mm_set1_epi16((odd_min<<8) | (uint8_t)even_min)
. (Note the cast to avoid sign-extension).
Then you only need one range-check. Which you should do more efficiently, not by emulating cmpge and cmple with 2 instructions each. A simple way, as suggested by Andrey in comments, is v == min(max(v, a), b)
, which is the same idea as your v == min(v, a)
.
Since you're using the same mins/maxes over many inputs, some extra setup work is worth it to make each range-check cheaper. The normal range-check trick of c - min < max-min
uses an unsigned compare, but we can do it with SSE signed compares by flipping the MSB of both sides, i.e. adding or subtracting 0x80. This is like range-shifting unsigned to signed. This can be part of the same subtraction, c - min - 0x80 < max - min - 0x80
(signed compare). (Thanks, @amonakov, for the reminder that this was possible.)
// unsigned compare-trick, range-shifted for use with pcmpgtb
// loop-invariant constants, set these up once
__m128i mins = _mm_set1_epi16( ((odd_min<<8) | (uint8_t)even_min) ^ 0x8080);
// if (odd_max == 0x7F && even_max == 0x7F){ ... } // TODO: just check vec > mins
__m128i maxes = _mm_set1_epi16( ((odd_max<<8) | (uint8_t)even_max) );
__m128i rangelen = _mm_sub_epi8(maxes, mins); // includes the 0x80 top bit from mins
// compilers will constant-propagate through this, except maybe MSVC. If that's a problem, write it a different way.
// Work inside the loop
__m128i vsub = _mm_sub_epi8(vec, mins);
__m128i vout_of_range = _mm_cmpgt_epi8(vsub, rangelen);
// TODO: check for off-by-one errors in case I got this wrong, or inclusive vs. exclusive.
// consider mins = 0^0x80, maxes = 1, rangelen=1^0x80 = -127.
// vec = 2: vsub = 2^0x80 = -126. -126 > -127 so it's out-of-range (by 2; this range is exclusive at the top).
bool isOk = !_mm_movemask_epi8(vout_of_range); // ok if no bits set
@chtz suggests one paddb
+ paddusb
+ pmovmskb
can be possible if the size of your range is less than 128. So in-range values won't have the MSB set in each byte but out-of-range values will end up greater than 128. (And can't wrap around because of saturation.) pmovmskb
grabs the MSB of each byte, so it works without needing a compare result. psubb
/ pcmpgtb
should be equally good on most CPUs. (Checking for != 0
is as cheap as == 0
for the bitmap result.)
min
/max
/cmpeq
Other possibilities include (v < mins) | (v > maxes)
and checking that no elements are true. _mm_movemask_epi8(or_result) == 0
. This has better critical-path latency than min/max/cmpeq since we have two independent compares, not a chain of 3 operations. Both ways need a copy of the original vector (unless you compile with AVX to allow a separate destination).
Or (v > min-1) & (v < max+1)
, which is viable for compile-time-constant min/max. If min is already INT8_MIN, it's always true so it optimizes to just needing the other condition. Except that's a problem when even_min is -128 but odd_min is something else: there's no value that will make pcmpgtb
always true for all inputs in the even lanes while still checking the odd lanes. I had been thinking the AND could be done as part of a ptest
(_mm_test_*
), but actually there's no _mm_test_all_ones
. ZF is cleared if there's any non-zero bit in the 128-bit AND result. (And same for CF, based on the ANDN result.)
Or use cmpgt
both times and invert one of the results as part of combining them, e.g. with _mm_andnot_si128
(pandn
)
ptest
is not very efficient on compare results since it decodes to 2 uops on most CPUs; pmovmskb
+ scalar cmp
or test
is also 2 uops (https://uops.info), and the cmp or test can macro-fuse with a branch if you're branching on it. ptest
does avoid needing a temporary register and might save a movdqa
register-copy if you're testing a vector that you also want to use later (not a comparison result), but usually good only if you're actually using its ability to check only some elements (like with your odd/even masks).
In your case, even with your strategy of two separate compares, probably better strategies would be 2x _mm_movemask_epi8
and (evens & (odds>>1) & 0x5555 == 0x5555
. (0x5555 is 0b0101...0101, just testing the even elements).
Or _mm_srli_epi16(odds, 8)
/ _mm_and_si128(evens, shifted_odds)
to get a vector where the even elements have the results you care about. (And the odd elements are zero because the logical shift produced zeros there, so _mm_movemask_epi8(and_result) == 0x5555
without needing to mask away the elements we don't care about.)