I would like to extract the index of the highest set bit in a 256 bit AVX register with 8 bit elements. I could neither find a bsr
nor a clz
implementation for this.
For clz
with 32 bit elements, there is the bithack with a float conversion, but that is probably not possible for 8 bits.
Currently, I am working on a solution, where I check the bits one by one, which I will add later, but I wonder if there is a faster way to do this.
Here is a vpshufb
based solution. The idea is to split the input into two halves, make a lookup on both and combine the results:
__m256i clz_epu8(__m256i values)
{
// extract upper nibble:
__m256i hi = _mm256_and_si256(_mm256_srli_epi16(values, 4), _mm256_set1_epi8(0xf));
// this sets the highest bit for values >= 0x10 and otherwise keeps the lower nibble unmodified:
__m256i lo = _mm256_adds_epu8(values, _mm256_set1_epi8(0x70));
// lookup tables for count-leading-zeros (replace this by _mm256_setr_epi8, if this does not get optimized away)
// ideally, this should compile to vbroadcastf128 ...
const __m256i lookup_hi = _mm256_broadcastsi128_si256(_mm_setr_epi8(0, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0));
const __m256i lookup_lo = _mm256_broadcastsi128_si256(_mm_setr_epi8(8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4));
// look up each half
__m256i clz_hi = _mm256_shuffle_epi8(lookup_hi, hi);
__m256i clz_lo = _mm256_shuffle_epi8(lookup_lo, lo);
// combine results (addition or xor would work as well)
return _mm256_or_si256(clz_hi, clz_lo);
}
godbolt-link with crude test: https://godbolt.org/z/MYq74Wxdh