Search code examples
cx86-64micro-optimizationswar

Fastest way to find 16bit match in a 4 element short array?


I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way

I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};. I know I can use SIMD to compare all elements at once, using movemask+tzcnt to find the index of a match. However since it's only 64 bits I was wondering if there's a faster way?

First I thought maybe I can build a 64-bit int by writing target|(target<<16)|(target<<32)|(target<<48) but then realized both an AND and SUB isn't the same as a compare since the low 16 can affect the higher 16. Then I thought instead of a plain loop I can write index=tzcnt((target==arr[0]?1:0)... | target==arr[3]?8:0

Can anyone think of something more clever? I suspect using the ternary method would give me best results since it's branchless?


Solution

  • For SWAR compare-for-equality, the operation you want is XOR, which like SUB produces all-zero on equal inputs, but unlike SUB doesn't propagate carry sideways.

    But then you need to detect a contiguous 16 0 bits. Unlike pcmpeqw, you'll have some zero bits in the other elements.

    So it's probably about the same as https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord but with wider mask patterns to operate on 16-bit instead of 8-bit chunks.

    There is yet a faster method — use hasless(v, 1), which is defined below; it works in 4 operations and requires no subsquent verification. It simplifies to

    #define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
    

    The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.

    This bithack was originally by Alan Mycroft in 1987.

    So it could look like this (untested):

    #include <stdint.h>
    #include <string.h>
    
    // returns 0 / non-zero status.
    uint64_t hasmatch_16in64(uint16_t needle, const uint16_t haystack[4])
    {
        uint64_t vneedle = 0x0001000100010001ULL * needle;  // broadcast
        uint64_t vbuf;
        memcpy(&vbuf, haystack, sizeof(vbuf));  // aliasing-safe unaligned load
            //static_assert(sizeof(vbuf) == 4*sizeof(haystack[0]));
    
        uint64_t match = vbuf ^ vneedle;
        uint64_t any_zeros = (match - 0x0001000100010001ULL) & ~match & 0x8000800080008000ULL;
        return any_zeros;
        // unsigned matchpos = _tzcnt_u32(any_zeros) >> 4;  // I think.
    }
    

    Godbolt with GCC and clang, also including a SIMD intrinsics version.

    # gcc12.2 -O3 -march=x86-64-v3 -mtune=znver1
    # x86-64-v3 is the Haswell/Zen1 baseline: AVX2+FMA+BMI2, but with tune=generic
    # without tune=haswell or whatever, GCC uses shl/add /shl/add instead of imul, despite still needing the same constant
    
    hasmatch_16in64:
            movabs  rax, 281479271743489       #    0x1000100010001
            movzx   edi, di                    # zero-extend to 64-bit
            imul    rdi, rax                   # vneedle
            xor     rdi, QWORD PTR [rsi]       # match
       # then the bithack
            mov     rdx, rdi
            sub     rdx, rax
            andn    rax, rdi, rdx              # BMI1
            movabs  rdx, -9223231297218904064  # 0x8000800080008000
            and     rax, rdx
            ret
    

    Clang unfortunately adds 0xFFFEFFFEFFFEFFFF instead of reusing the multiplier constant, so it has three 64-bit immediate constants.

    AArch64 can do repeating-pattern constants like this as immediates for bitwise ops, and doesn't have as convenient SIMD movemask, so this might be more of a win there, especially if you can guarantee alignment of your array of shorts.


    Match position

    If you need to know where the match is, I think that bithack has a 1 in the high bit of each zero byte or u16, and nowhere else. (The lowest-precendence / last operations are bitwise AND involving 0x80008000...).

    So maybe tzcnt(any_zeros) >> 4 to go from bit-index to u16-index, rounding down. e.g. if the second one is zero, the tzcnt result will be 31. 31 >> 4 = 1.


    If that doesn't work, then yeah AVX2 or AVX-512 vpbroadcastw xmm0, edi / vmovq / vpcmeqw / vpmovmskb / tzcnt will work well, too, with smaller code-size and fewer uops, but maybe higher latency. Or maybe less. (To get a byte offset, right shift if you need an index of which short.)

    Actually just SSE2 pshuflw can broadcast a word to the low qword of an XMM register. Same for MMX, which would actually allow a memory-source pcmpeqw mm0, [rsi] since it has no alignment requirement and is only 64-bit, not 128.

    If you can use SIMD intrinsics, especially if you have efficient word broadcast from AVX2, definitely have a look at it.

    #include <immintrin.h>
    
    // note the unsigned function arg, not uint16_t;
    // we only use the low 16, but GCC doesn't realize that and wastes an instruction in the non-AVX2 version
    int hasmatch_SIMD(unsigned needle, const uint16_t haystack[4])
    {
    #ifdef __AVX2__   // or higher
        __m128i vneedle = _mm_set1_epi16(needle);
    #else
        __m128i vneedle =  _mm_cvtsi32_si128(needle);  // movd
        vneedle = _mm_shufflelo_epi16(vneedle, 0);     // broadcast to low half
    #endif
    
        __m128i vbuf = _mm_loadl_epi64((void*)haystack);    // alignment and aliasing safe
        unsigned mask = _mm_movemask_epi8(_mm_cmpeq_epi16(vneedle, vbuf));
        //return _tzcnt_u32(mask) >> 1;
        return mask;
    }
    
    # clang expects narrow integer args to already be zero- or sign-extended to 32
    hasmatch_SIMD:
            movd    xmm0, edi
            pshuflw xmm0, xmm0, 0                   # xmm0 = xmm0[0,0,0,0,4,5,6,7]
            movq    xmm1, qword ptr [rsi]           # xmm1 = mem[0],zero
            pcmpeqw xmm1, xmm0
            pmovmskb        eax, xmm1
            ret
    

    AXV-512 gives us vpbroadcastw xmm0, edi, replacing vmovd + vpbroadcastw xmm,xmm or movd + pshuflw, saving a shuffle uop.

    With AVX2, this is 5 single-uop instructions, vs. 7 (or 9 counting the constants) for the SWAR bithack. Or 6 or 8 not counting the zero-extension of the "needle". So SIMD is better for front-end throughput. (https://agner.org/optimize/ / https://uops.info/)

    There are limits to which ports some of these instructions can run on (vs. the bithack instructions mostly being any integer ALU port), but presumably you're not doing this in a loop over many such 4-element arrays. Or else SIMD is an obvious win; checking two 4-element arrays at once in the low and high halves of a __m128i. So probably we do need to consider the front-end costs of setting up those constants.

    I didn't add up the latencies; it's probably a bit higher even on Intel CPUs which generally have good latency between integer and SIMD units.

    GCC unfortunately fails to optimize away the movzx edi, di from the SIMD version if compiled without AVX2; only clang realizes the upper 16 of _mm_cvtsi32_si128(needle) is discarded by the later shuffle. Maybe better to make the function arg unsigned, not explicitly a narrow 16-bit type.