Search code examples
c++simdintrinsicsavxavx2

AVX2: BitScanReverse or CountLeadingZeros on 8 bit elements in AVX register


I would like to extract the index of the highest set bit in a 256 bit AVX register with 8 bit elements. I could neither find a bsr nor a clz implementation for this.

For clz with 32 bit elements, there is the bithack with a float conversion, but that is probably not possible for 8 bits.

Currently, I am working on a solution, where I check the bits one by one, which I will add later, but I wonder if there is a faster way to do this.


Solution

  • Here is a vpshufb based solution. The idea is to split the input into two halves, make a lookup on both and combine the results:

    __m256i clz_epu8(__m256i values)
    {
        // extract upper nibble:
        __m256i hi = _mm256_and_si256(_mm256_srli_epi16(values, 4), _mm256_set1_epi8(0xf));
        // this sets the highest bit for values >= 0x10 and otherwise keeps the lower nibble unmodified:
        __m256i lo = _mm256_adds_epu8(values, _mm256_set1_epi8(0x70));
    
        // lookup tables for count-leading-zeros (replace this by _mm256_setr_epi8, if this does not get optimized away)
        // ideally, this should compile to vbroadcastf128 ...
        const __m256i lookup_hi = _mm256_broadcastsi128_si256(_mm_setr_epi8(0, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0));
        const __m256i lookup_lo = _mm256_broadcastsi128_si256(_mm_setr_epi8(8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4));
    
        // look up each half
        __m256i clz_hi = _mm256_shuffle_epi8(lookup_hi, hi);
        __m256i clz_lo = _mm256_shuffle_epi8(lookup_lo, lo);
    
        // combine results (addition or xor would work as well)
        return _mm256_or_si256(clz_hi, clz_lo);
    }
    

    godbolt-link with crude test: https://godbolt.org/z/MYq74Wxdh