Search code examples
c++simdintrinsicsavxavx2

AVX2: CountTrailingZeros on 8 bit elements in AVX register


I would like to have an implementation for a function like _mm256_lzcnt_epi8(__m256i a), where for every 8 bit element the number of trailing zeros is counted and extracted.

In a previous question to implement counting leading zeros there is a solution using a lookup table. I wonder if one can use the same method for this.

Only AVX and AVX2 please and behaviour for 0 as input can be undefined.

AVX2: BitScanReverse or CountLeadingZeros on 8 bit elements in AVX register

Thanks for your help!


Solution

  • The same LUT as in the answer by chtz in that question should work.

    Saturation trick won't work, but _mm256_blendv_epi8 can be used to select which LUT results to use.

    The low LUT is the answers for values 0..15, for 0 it is 0xFF to see in the other LUT via blendv.

    Like this (not tested):

    __m256i ctz_epu8(__m256i values)
    {
        // extract upper nibble:
        __m256i hi = _mm256_and_si256(_mm256_srli_epi16(values, 4), _mm256_set1_epi8(0xf));
        // extract lower nibble:
        __m256i lo = _mm256_and_si256(values, _mm256_set1_epi8(0xf));
    
                                                                       // 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0 
        const __m256i lookup_hi = _mm256_broadcastsi128_si256(_mm_set_epi8(4, 5, 4, 6, 4, 5, 4, 7, 4, 5, 4, 6, 4, 5, 4, 8));
        
                                                                       // 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
        const __m256i lookup_lo = _mm256_broadcastsi128_si256(_mm_set_epi8(0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0, 0xFF));
    
        // look up each half
        __m256i ctz_hi = _mm256_shuffle_epi8(lookup_hi, hi);
        __m256i ctz_lo = _mm256_shuffle_epi8(lookup_lo, lo);
    
        // combine results
        return _mm256_blendv_epi8(ctz_lo, ctz_hi, ctz_lo);
    }