Search code examples
cgccbit-manipulationintrinsicsavx512

strlen AVX-512 __builtin_ctz invalid value


i wrote strlen function with avx-512 instructions and this is my source code

size_t avx512_strlen(const char * s) {
    __m512i vec0, vec1;
    unsigned long long mask;
    const char * ptr = s;

    vec0 = _mm512_setzero_epi32();

    while (1) {
        vec1 = _mm512_loadu_si512(s);
        mask = _mm512_cmpeq_epi8_mask(vec0, vec1);

        if(mask != 0) {
            mask = __builtin_ctz(mask);
            return (s-ptr) + mask;
        }

        s += 64;
    }

    return s-ptr;
}

there is a problem in the value of '__builtin_ctz(mask)' and the returned value is not correct. in fact, this function can not calculates the position of null-terminator (0x00) in the last-check

for example, i have this string

char str[] = "EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
                 "EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
                 "EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
                 "EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE";

the length of this string is (360) but this function returns (352) which the problem is come from '__builtin_ctz' part. before performing '__builtin_ctz', the provided mask is correct and it's

0001110100010001000100010000000000000000000000000000000000000000

in the last-check, we checked 320 characters and __builtin_ctz must returns (40) (as you can see in the mask, we count 40 zeros to first '1' and provided mask is correct and '__builtin_ctz' count it wrong !

what is the problem?


Solution

  • __builtin_ctz operates on unsigned int, which is likely 32 bits on any x86 platform. Meanwhile, unsigned long long is likely 64 bits on any x86 platform. So your mask is truncated at this line:

                mask = __builtin_ctz(mask);
    

    Since the low 32 bits are all zero, the result is undefined (per GCC):

    Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.

    (Despite being undefined, 352 - 320 = 32 is a reasonable answer for "number of trailing 0 bits in a 32-bit zero integer.")

    You probably meant to use __builtin_ctzll(mask) instead. That should get you the correct count.