Search code examples
x86simdavxmicro-optimizationavx2

Is using AVX2 can implement a faster processing of LZCNT on a word array?


I need to bit scan reverse with LZCNT an array of words: 16 bits.

The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to be 4.

I am trying to find an algorithm using the AVX2 instruction set to be faster.

I know AVX-512 has VPLZCNTD for 32-bit elements, so if I had AVX512CD I could unpack and use that.

With just the AVX2 instruction set, it is possible to code an algorithm faster than using the x86 asm LZCNT instruction?


Solution

  • #include <immintrin.h>
    
    __m256i avx2_lzcnt_epi16(__m256i v) {
        const __m256i lut_lo = _mm256_set_epi8(
            4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 16,
            4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 16
        );
        const __m256i lut_hi = _mm256_set_epi8(
            0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 16,
            0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 16
        );
        const __m256i nibble_mask = _mm256_set1_epi8(0x0F);
        const __m256i byte_offset = _mm256_set1_epi16(0x0008);
        __m256i t;
    
        t = _mm256_and_si256(nibble_mask, v);
        v = _mm256_and_si256(_mm256_srli_epi16(v, 4), nibble_mask);
        t = _mm256_shuffle_epi8(lut_lo, t);
        v = _mm256_shuffle_epi8(lut_hi, v);
        v = _mm256_min_epu8(v, t);
    
        t = _mm256_srli_epi16(v, 8);
        v = _mm256_or_si256(v, byte_offset);
        v = _mm256_min_epu8(v, t);
    
        return v;
    }
    
    // 16 - lzcnt_u16(subwords)
    __m256i avx2_ms1b_epi16(__m256i v) {
        const __m256i lut_lo = _mm256_set_epi8(
            12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 11, 11, 10, 10, 9, 0,
            12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 11, 11, 10, 10, 9, 0
        );
        const __m256i lut_hi = _mm256_set_epi8(
            16, 16, 16, 16, 16, 16, 16, 16, 15, 15, 15, 15, 14, 14, 13, 0,
            16, 16, 16, 16, 16, 16, 16, 16, 15, 15, 15, 15, 14, 14, 13, 0
        );
        const __m256i nibble_mask = _mm256_set1_epi8(0x0F);
        const __m256i adj = _mm256_set1_epi16(0x1F08);
        __m256i t;
    
        t = _mm256_and_si256(nibble_mask, v);
        v = _mm256_and_si256(_mm256_srli_epi16(v, 4), nibble_mask);
        t = _mm256_shuffle_epi8(lut_lo, t);
        v = _mm256_shuffle_epi8(lut_hi, v);
        v = _mm256_max_epu8(v, t);
    
        t = _mm256_srli_epi16(v, 8);
        v = _mm256_sub_epi8(v, adj);
        v = _mm256_max_epi8(v, t);
    
        return v;
    }
    

    For results packed into uint8 use _mm256_packs_epi16(). For packed results in the correct order also use _mm256_permute4x64_epi64().

    Solution from r/SIMD. This solution was also described in the comments here.