i wrote strlen function with avx-512 instructions and this is my source code
size_t avx512_strlen(const char * s) {
__m512i vec0, vec1;
unsigned long long mask;
const char * ptr = s;
vec0 = _mm512_setzero_epi32();
while (1) {
vec1 = _mm512_loadu_si512(s);
mask = _mm512_cmpeq_epi8_mask(vec0, vec1);
if(mask != 0) {
mask = __builtin_ctz(mask);
return (s-ptr) + mask;
}
s += 64;
}
return s-ptr;
}
there is a problem in the value of '__builtin_ctz(mask)' and the returned value is not correct. in fact, this function can not calculates the position of null-terminator (0x00) in the last-check
for example, i have this string
char str[] = "EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
"EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
"EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE"
"EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE";
the length of this string is (360) but this function returns (352) which the problem is come from '__builtin_ctz' part. before performing '__builtin_ctz', the provided mask is correct and it's
0001110100010001000100010000000000000000000000000000000000000000
in the last-check, we checked 320 characters and __builtin_ctz must returns (40) (as you can see in the mask, we count 40 zeros to first '1' and provided mask is correct and '__builtin_ctz' count it wrong !
what is the problem?
__builtin_ctz
operates on unsigned int
, which is likely 32 bits on any x86 platform. Meanwhile, unsigned long long
is likely 64 bits on any x86 platform. So your mask is truncated at this line:
mask = __builtin_ctz(mask);
Since the low 32 bits are all zero, the result is undefined (per GCC):
Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.
(Despite being undefined, 352 - 320 = 32
is a reasonable answer for "number of trailing 0 bits in a 32-bit zero integer.")
You probably meant to use __builtin_ctzll(mask)
instead. That should get you the correct count.