How to check if even/odd lanes are in given ranges using SIMD?

Given a __m128i which stores 16 chars, the even-index lane refers to even lane (i.e., lanes at 0, 2, 4, ..., 14), and odd-index lane refers to odd lane (i.e, lanes at 1, 3, 5, ... 15).

In my application, even/odd lane must be in given ranges. For example, suppose even_min is 1, even_max is 7, odd_min is 5, and odd_max is 10:

# valid
vec1: [1, 5, 6, 10, 2, 6, 4, 6, 2, 7, 4, 9, 2, 7, 4, 8] 

# invalid because 0-th (even) is greater than even_max
vec2: [8, 5, 6, 10, 2, 6, 4, 6, 2, 7, 4, 9, 2, 7, 4, 8]

How to check whether it is valid more efficiently?

My current solution is very straightforward by checking the two comparison results respectively:

  __m128i even_min = _mm_set1_epi8(xxx);
  __m128i even_max = _mm_set1_epi8(xxx);
  __m128i even_mask =
      _mm_set_epi8(0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1);

  __m128i evenRange = _mm_and_si128(_mm_cmpge_epi8(vec, even_min),
                                    _mm_cmple_epi8(vec, even_max));
  bool isEvenOk = _mm_testc_si128(evenRange, even_mask);

// the code for checking odd bytes is similar

Note that to compare unsigned chars using inclusive condition, two macros are defined as following:

#define _mm_cmpge_epi8(a, b) _mm_cmpeq_epi8(_mm_max_epu8(a, b), a)

#define _mm_cmple_epi8(a, b) _mm_cmpge_epi8(b, a)

Solution

Make one vector of per-lane min values, and another of per-lane maxes. For example, _mm_set1_epi16((odd_min<<8) | (uint8_t)even_min). (Note the cast to avoid sign-extension).

Then you only need one range-check. Which you should do more efficiently, not by emulating cmpge and cmple with 2 instructions each. A simple way, as suggested by Andrey in comments, is v == min(max(v, a), b), which is the same idea as your v == min(v, a).

Since you're using the same mins/maxes over many inputs, some extra setup work is worth it to make each range-check cheaper. The normal range-check trick of c - min < max-min uses an unsigned compare, but we can do it with SSE signed compares by flipping the MSB of both sides, i.e. adding or subtracting 0x80. This is like range-shifting unsigned to signed. This can be part of the same subtraction, c - min - 0x80 < max - min - 0x80 (signed compare). (Thanks, @amonakov, for the reminder that this was possible.)

// unsigned compare-trick, range-shifted for use with  pcmpgtb

// loop-invariant constants, set these up once
  __m128i mins = _mm_set1_epi16( ((odd_min<<8) | (uint8_t)even_min) ^ 0x8080);
  // if (odd_max == 0x7F && even_max == 0x7F){ ... }  // TODO: just check vec > mins
  __m128i maxes = _mm_set1_epi16( ((odd_max<<8) | (uint8_t)even_max) );
  __m128i rangelen = _mm_sub_epi8(maxes, mins);   // includes the 0x80 top bit from mins
   // compilers will constant-propagate through this, except maybe MSVC.  If that's a problem, write it a different way.

// Work inside the loop
  __m128i vsub = _mm_sub_epi8(vec, mins);
  __m128i vout_of_range = _mm_cmpgt_epi8(vsub, rangelen);
  // TODO: check for off-by-one errors in case I got this wrong, or inclusive vs. exclusive.
   // consider mins = 0^0x80, maxes = 1, rangelen=1^0x80 = -127.  
   // vec = 2: vsub = 2^0x80 = -126.  -126 > -127 so it's out-of-range (by 2; this range is exclusive at the top).

  bool isOk = !_mm_movemask_epi8(vout_of_range);  // ok if no bits set

@chtz suggests one paddb + paddusb + pmovmskb can be possible if the size of your range is less than 128. So in-range values won't have the MSB set in each byte but out-of-range values will end up greater than 128. (And can't wrap around because of saturation.) pmovmskb grabs the MSB of each byte, so it works without needing a compare result. psubb / pcmpgtb should be equally good on most CPUs. (Checking for != 0 is as cheap as == 0 for the bitmap result.)

Other ways: worse than sub / cmpgt, better than `min`/`max`/`cmpeq`

Other possibilities include (v < mins) | (v > maxes) and checking that no elements are true. _mm_movemask_epi8(or_result) == 0. This has better critical-path latency than min/max/cmpeq since we have two independent compares, not a chain of 3 operations. Both ways need a copy of the original vector (unless you compile with AVX to allow a separate destination).

Or (v > min-1) & (v < max+1), which is viable for compile-time-constant min/max. If min is already INT8_MIN, it's always true so it optimizes to just needing the other condition. Except that's a problem when even_min is -128 but odd_min is something else: there's no value that will make pcmpgtb always true for all inputs in the even lanes while still checking the odd lanes. I had been thinking the AND could be done as part of a ptest (_mm_test_*), but actually there's no _mm_test_all_ones. ZF is cleared if there's any non-zero bit in the 128-bit AND result. (And same for CF, based on the ANDN result.)

Or use cmpgt both times and invert one of the results as part of combining them, e.g. with _mm_andnot_si128 (pandn)

ptest is not very efficient on compare results since it decodes to 2 uops on most CPUs; pmovmskb + scalar cmp or test is also 2 uops (https://uops.info), and the cmp or test can macro-fuse with a branch if you're branching on it. ptest does avoid needing a temporary register and might save a movdqa register-copy if you're testing a vector that you also want to use later (not a comparison result), but usually good only if you're actually using its ability to check only some elements (like with your odd/even masks).

In your case, even with your strategy of two separate compares, probably better strategies would be 2x _mm_movemask_epi8 and (evens & (odds>>1) & 0x5555 == 0x5555. (0x5555 is 0b0101...0101, just testing the even elements).

Or _mm_srli_epi16(odds, 8) / _mm_and_si128(evens, shifted_odds) to get a vector where the even elements have the results you care about. (And the odd elements are zero because the logical shift produced zeros there, so _mm_movemask_epi8(and_result) == 0x5555 without needing to mask away the elements we don't care about.)

How to check if even/odd lanes are in given ranges using SIMD?

Other ways: worse than sub / cmpgt, better than min/max/cmpeq

Other ways: worse than sub / cmpgt, better than `min`/`max`/`cmpeq`