String length function is unstable

So I made this strlen a while ago and everything seemed fine. But I started noticing bugs with my codebase and after a while I tracked it down to this strlen function. I used SIMD instructions to write it and I am new to writing intrinsics so the code isn't probably the best it could be either.

Here is the function:

inline size_t strlen(const char* data) {
        const __m256i terminationCharacters = _mm256_setzero_si256();
        const size_t shiftAmount = ((size_t)&data) & 31;
        const __m256i* pointer = (const __m256i*) (data - shiftAmount);

        size_t length = 0;

        for (;; length += 32, ++pointer) {
            const __m256i comparingData = _mm256_load_si256(pointer);
            const __m256i comparison = _mm256_cmpeq_epi8(comparingData, terminationCharacters);

            if (!_mm256_testc_si256(terminationCharacters, comparison)) {
                const auto mask = _mm256_movemask_epi8(comparison);

                return length + _tzcnt_u32(mask >> shiftAmount);
            }
        }
    }

Solution

Your attempt to combine startup handling into the aligned-vector loop has at least 2 showstopper bugs:

You exit the loop if your aligned load finds any zero bytes, even if they're from before the proper start of the string. (@James Griffin spotted this in comments). You need to do mask >>= shiftAmount and check that for non-zero to see if there were any matches in the part of the load that comes after the start of the string. (Don't use _mm256_testc_si256, just movemask and check).
_tzcnt_u32(mask >> shiftAmount); is buggy for any vectors after the first. The whole vector comes from bytes after the start of the string, so you need tzcnt to see all of bits. Instead, you want _tzcnt_u32(mask) - shiftAmount, I think.

Make yourself some test cases with 0 bytes before the actual string but inside the first aligned vector. And test cases with the final 0 in different places relative to a vector, and non-zero and test your version against libc strlen. (Maybe even for some randomized 0-positions within the first 32 bytes, and then within the first 64 bytes after that.)

Your strategy for handling unaligned startup should work, if you separate it from the loop. (Is it safe to read past the end of a buffer within the same page on x86 and x64?).

Another option is a page-cross check before a first unaligned vector load from the actual start of the string. (But then you need a fallback to something else). Then go aligned: overlap is fine; as long as you calculate the final length correctly, it doesn't matter if you check the same byte twice for being zero.

(You also don't really want the compiler to be wasting instructions inside the loop incrementing a pointer and a separate length, so check the resulting asm. A pointer-subtract after the loop should do the trick. Even cast to uintptr_t.
Also, you can subtract the final zero-position from the initial function arg, instead of from the aligned pointer, so instead of subtracting shiftAmount twice, you're just not using it at all except for the initial alignment.)

Don't use the vptest intrinsic (_mm256_testc_si256) at all, even in the main loop when you should be checking all the bytes; it's not better for _mm_cmp* results. vptest is 2 uops and can't macro-fuse with a branch instruction. But vpmovmskb eax, ymm0 is 1 uop, and test eax,eax / jz .loop is another one macro-fused uop. And even better, you actually need the integer movemask result after the loop, so you already have it.

Related

Is it safe to read past the end of a buffer within the same page on x86 and x64?
Why does glibc's strlen need to be so complicated to run quickly? (includes links to hand-written x86-64 asm for glibc's strlen implementation.) Unless you're on a platform with a worse C library, normally you should use that, because glibc uses CPU detection during dynamic linking to select a good version of strlen (and memcpy, etc.) for your CPU. Unaligned-startup for strlen is somewhat tricky, and glibc I think makes reasonable choices, unless the function-call overhead is a big problem. It also has good loop-unrolling techniques for big strings (like _mm256_min_epu8 to get a zero in a vector element if either of 2 input vectors had a zero, so it can amortize the actual movemask/branch work over a whole cache-line of data). It might be too aggressive in ramping up to that for medium-length strings though.

Note that glibc's licence is the LGPL, so you can't just copy code from glibc into your project unless your license is compatible. Even writing an intrinsics equivalent of its asm might be questionable.
Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? - a simple SSE2 strlen that doesn't handle misalignment, in hand-written asm. And comments on benchmarking.
https://agner.org/optimize/ - guides and instruction tables, and his subroutine library (in hand-written asm) includes a strlen. (But note it's GPL licensed.)

I assume some of the BSDs and MacOS have an asm strlen under a more permissive license you could use / look at if your project isn't GPL-compatible.