Fastest way to find 16bit match in a 4 element short array?

I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way

I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};. I know I can use SIMD to compare all elements at once, using movemask+tzcnt to find the index of a match. However since it's only 64 bits I was wondering if there's a faster way?

First I thought maybe I can build a 64-bit int by writing target|(target<<16)|(target<<32)|(target<<48) but then realized both an AND and SUB isn't the same as a compare since the low 16 can affect the higher 16. Then I thought instead of a plain loop I can write index=tzcnt((target==arr[0]?1:0)... | target==arr[3]?8:0

Can anyone think of something more clever? I suspect using the ternary method would give me best results since it's branchless?

Solution

For SWAR compare-for-equality, the operation you want is XOR, which like SUB produces all-zero on equal inputs, but unlike SUB doesn't propagate carry sideways.

But then you need to detect a contiguous 16 0 bits. Unlike pcmpeqw, you'll have some zero bits in the other elements.

So it's probably about the same as https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord but with wider mask patterns to operate on 16-bit instead of 8-bit chunks.

There is yet a faster method — use hasless(v, 1), which is defined below; it works in 4 operations and requires no subsquent verification. It simplifies to
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.

This bithack was originally by Alan Mycroft in 1987.

So it could look like this (untested):

#include <stdint.h>
#include <string.h>

// returns 0 / non-zero status.
uint64_t hasmatch_16in64(uint16_t needle, const uint16_t haystack[4])
{
    uint64_t vneedle = 0x0001000100010001ULL * needle;  // broadcast
    uint64_t vbuf;
    memcpy(&vbuf, haystack, sizeof(vbuf));  // aliasing-safe unaligned load
        //static_assert(sizeof(vbuf) == 4*sizeof(haystack[0]));

    uint64_t match = vbuf ^ vneedle;
    uint64_t any_zeros = (match - 0x0001000100010001ULL) & ~match & 0x8000800080008000ULL;
    return any_zeros;
    // unsigned matchpos = _tzcnt_u32(any_zeros) >> 4;  // I think.
}

Godbolt with GCC and clang, also including a SIMD intrinsics version.

# gcc12.2 -O3 -march=x86-64-v3 -mtune=znver1
# x86-64-v3 is the Haswell/Zen1 baseline: AVX2+FMA+BMI2, but with tune=generic
# without tune=haswell or whatever, GCC uses shl/add /shl/add instead of imul, despite still needing the same constant

hasmatch_16in64:
        movabs  rax, 281479271743489       #    0x1000100010001
        movzx   edi, di                    # zero-extend to 64-bit
        imul    rdi, rax                   # vneedle
        xor     rdi, QWORD PTR [rsi]       # match
   # then the bithack
        mov     rdx, rdi
        sub     rdx, rax
        andn    rax, rdi, rdx              # BMI1
        movabs  rdx, -9223231297218904064  # 0x8000800080008000
        and     rax, rdx
        ret

Clang unfortunately adds 0xFFFEFFFEFFFEFFFF instead of reusing the multiplier constant, so it has three 64-bit immediate constants.

AArch64 can do repeating-pattern constants like this as immediates for bitwise ops, and doesn't have as convenient SIMD movemask, so this might be more of a win there, especially if you can guarantee alignment of your array of shorts.

Match position

If you need to know where the match is, I think that bithack has a 1 in the high bit of each zero byte or u16, and nowhere else. (The lowest-precendence / last operations are bitwise AND involving 0x80008000...).

So maybe tzcnt(any_zeros) >> 4 to go from bit-index to u16-index, rounding down. e.g. if the second one is zero, the tzcnt result will be 31. 31 >> 4 = 1.

If that doesn't work, then yeah AVX2 or AVX-512 vpbroadcastw xmm0, edi / vmovq / vpcmeqw / vpmovmskb / tzcnt will work well, too, with smaller code-size and fewer uops, but maybe higher latency. Or maybe less. (To get a byte offset, right shift if you need an index of which short.)

Actually just SSE2 pshuflw can broadcast a word to the low qword of an XMM register. Same for MMX, which would actually allow a memory-source pcmpeqw mm0, [rsi] since it has no alignment requirement and is only 64-bit, not 128.

If you can use SIMD intrinsics, especially if you have efficient word broadcast from AVX2, definitely have a look at it.

#include <immintrin.h>

// note the unsigned function arg, not uint16_t;
// we only use the low 16, but GCC doesn't realize that and wastes an instruction in the non-AVX2 version
int hasmatch_SIMD(unsigned needle, const uint16_t haystack[4])
{
#ifdef __AVX2__   // or higher
    __m128i vneedle = _mm_set1_epi16(needle);
#else
    __m128i vneedle =  _mm_cvtsi32_si128(needle);  // movd
    vneedle = _mm_shufflelo_epi16(vneedle, 0);     // broadcast to low half
#endif

    __m128i vbuf = _mm_loadl_epi64((void*)haystack);    // alignment and aliasing safe
    unsigned mask = _mm_movemask_epi8(_mm_cmpeq_epi16(vneedle, vbuf));
    //return _tzcnt_u32(mask) >> 1;
    return mask;
}

# clang expects narrow integer args to already be zero- or sign-extended to 32
hasmatch_SIMD:
        movd    xmm0, edi
        pshuflw xmm0, xmm0, 0                   # xmm0 = xmm0[0,0,0,0,4,5,6,7]
        movq    xmm1, qword ptr [rsi]           # xmm1 = mem[0],zero
        pcmpeqw xmm1, xmm0
        pmovmskb        eax, xmm1
        ret

AXV-512 gives us vpbroadcastw xmm0, edi, replacing vmovd + vpbroadcastw xmm,xmm or movd + pshuflw, saving a shuffle uop.

With AVX2, this is 5 single-uop instructions, vs. 7 (or 9 counting the constants) for the SWAR bithack. Or 6 or 8 not counting the zero-extension of the "needle". So SIMD is better for front-end throughput. (https://agner.org/optimize/ / https://uops.info/)

There are limits to which ports some of these instructions can run on (vs. the bithack instructions mostly being any integer ALU port), but presumably you're not doing this in a loop over many such 4-element arrays. Or else SIMD is an obvious win; checking two 4-element arrays at once in the low and high halves of a __m128i. So probably we do need to consider the front-end costs of setting up those constants.

I didn't add up the latencies; it's probably a bit higher even on Intel CPUs which generally have good latency between integer and SIMD units.

GCC unfortunately fails to optimize away the movzx edi, di from the SIMD version if compiled without AVX2; only clang realizes the upper 16 of _mm_cvtsi32_si128(needle) is discarded by the later shuffle. Maybe better to make the function arg unsigned, not explicitly a narrow 16-bit type.