Search code examples
x86ssesimdavxavx2

Find the first instance of a character using simd


I am trying to find the first instance of a character, in this case '"' using simd (AVX2 or earlier). I'd like to use _mm256_cmpeq_epi8, but then I need a quick way of finding if any of the resulting bytes in the __m256i have been set to 0xFF. The plan was then to use _mm256_movemask_epi8 to convert the result from bytes to bits, and the to use ffs to get a matching index. Is it better to move out a portion at a time using _mm_movemask_epi8? Any other suggestions?


Solution

  • You have the right idea with _mm256_cmpeq_epi8 -> _mm256_movemask_epi8. That's the optimal way to implement this with AVX2. VPMOVMSKB r32, ymm is the same speed as the XMM 16-byte version, so it would be a huge loss to unpack the two lanes of a 256b vector and movemask them separately and then recombine the integer results. (Source: Agner Fog's instruction table, or more recently https://uops.info/ has better data. See other perf links in the tag wiki.)

    Make the code inside the loop as efficient as possible by leaving the ffs or _tzcnt_u32 until after you've identified a non-zero result from _mm256_movemask_epi8.

    TEST/JCC can macro fuse into a single uop, but BSF/JCC doesn't, so it takes an extra instruction. (And you'd be hard-pressed to get a C compiler to emit BSF/JCC anyway. More likely branching on the result of ffs would give you some kind of test for the input being non-zero, then BSF, then add 1, then compare-and-branch. That's obviously horrible compared to just testing the movemask result.)

    (Update, in C++20, use std::countr_zero. It can compile to a single tzcnt, instead of the off-by-one of ffs. Since you've already checked for the mask being non-zero, hopefully can optimize to a single (rep) bsf instruction if it isn't sure all CPUs running the code will support tzcnt. If you can assume BMI1 in your target CPUs, which you usually can for AVX2 code, then enable that so you'll reliably get an efficient tzcnt.)

    For related problems like memcmp, you can check movemask == 0xFFFFFFFF to check that all bytes matched; that's just as efficient as branching on it being non-zero to find if any elements matched. (Then std::countr_zero(~mask) to find the first mismatch. Or I guess std::countr_one in case any ISA has an instruction for that... but we know we're compiling for x86 so it'll internally have to use NOT / TZCNT anyway.)


    As Paul R suggested, looking at some strlen, strchr, and memchr implementations may be informative. There are multiple hand-written asm implementations in open-source libc implementations, and other places. (e.g. glibc, and Agner Fog's asmlib.)

    Many of glibc's versions scan up to an alignment boundary, then use an unrolled loop that reads 64B at a time (in 4 SSE vectors, since I don't think glibc has an AVX2 version). For str (not mem) functions, that's necessary for correctness: Is it safe to read past the end of a buffer within the same page on x86 and x64? (technically no in C++, but it's hard for the compiler to notice. Safe in asm.)

    To optimize for long strings, reduce overhead from testing the compare results by ORing the compare results together, and check that. If you find a hit, go back and re-test your vectors to see which vector had the hit.

    It may be somewhat more efficient to do the ffs on one 64-bit integer that you built up out of multiple movemask results (with shift and |). Don't do that inside the loop, though, just test mask1 | mask2 for non-zero (or/jz keep_looping). It's useful for sorting out which of two vectors has the match if you were already unrolling, though. Have a look at what Agner Fog's string functions do, or if you don't mind reading GPLed code, have a look at glibc.


    Everything I've suggested here is stuff can be seen in asm in various glibc strategies for strlen, memchr, and related functions. Here's sysdeps/x86_64/strlen.S (LGPL license), but there may be another source file somewhere using more than baseline SSE2. Or not, I might be thinking of a different function, there's probably nothing to be gained from SSE4, only maybe AVX (3-operand insns) and definitely AVX2 (256b integer vectors).

    See also:

    • glibc's strchr-avx2.S (codebrowser.dev has a nice source browser with a useful search for filenames / symbols).
    • glibc's memchr-avx2.S

    glibc's memchr uses PMAXUB instead of POR. I'm not sure if that's useful for some arcane microarchitectural reason, but it runs on fewer ports on most CPUs. Perhaps that's desired, to avoid resource conflicts with something else? IDK, seems weird, since it competes with PCMPEQB.

    Perhaps the author was thinking of min/max operations because pminub is useful in glibc's strlen algorithm, where pminub before compare gives a zero byte iff either input has a zero.