Search code examples
c++simdsseavxsse2

AVX divide __m256i packed 32-bit integers by two (no AVX2)


I'm looking for the fastest way to divide an __m256i of packed 32-bit integers by two (aka shift right by one) using AVX. I don't have access to AVX2. As far as I know, my options are:

  1. Drop down to SSE2
  2. Something like AVX __m256i integer division for signed 32-bit elements

In case I need to go down to SSE2 I'd appreciate the best SSE2 implementation. In case it's 2), I'd like to know the intrinsics to use and also if there's a more optimized implementation for specifically dividing by 2. Thanks!


Solution

  • Assuming you know what you’re doing, here’s that function.

    inline __m256i div2_epi32( __m256i vec )
    {
        // Split the 32-byte vector into 16-byte ones
        __m128i low = _mm256_castsi256_si128( vec );
        __m128i high = _mm256_extractf128_si256( vec, 1 );
        // Shift the lanes within each piece; replace with _mm_srli_epi32 for unsigned version
        low = _mm_srai_epi32( low, 1 );
        high = _mm_srai_epi32( high, 1 );
        // Combine back into 32-byte vector
        vec = _mm256_castsi128_si256( low );
        return _mm256_insertf128_si256( vec, high, 1 );
    }
    

    However, doing that is not necessarily faster than dealing with 16-byte vectors. On most CPUs, the performance of these insert/extract instructions ain’t great, except maybe AMD Zen 1 CPU.