Search code examples
simdsseavx

How to convert int 64 to int 32 with avx (but without avx-512)


I'd like to convert / pack a register of 4 longs (64 bits) to 4 ints (32 bits). In other words to convert a __m256i of int64 to a __m128i of int32.

I don't have avx-512 at my disposal so the intrisic:

__m256i n;
__m128i r128i = _mm256_cvtepi64_epi32( n );

is not available for me. Any alternatives better than the ones below?

loosing vectorization:

__m256i n;
alignas(32) int64_t temp[4];
_mm256_store_si256((__m256i*)temp, n);
int32_t a = (int32_t)temp[0];
int32_t b = (int32_t)temp[1];
int32_t c = (int32_t)temp[2];
int32_t d = (int32_t)temp[3];
__m128i r128i = _mm_set_epi32(a, b, c, d);

this packs into 16bits integer instead of 32bits

__m128i lo_lane = _mm256_castsi256_si128(n);
__m128i hi_lane = _mm256_extracti128_si256(n, 1);
__m128i r128i = _mm_packus_epi32(lo_lane, hi_lane);

Solution

  • So just truncation, not signed (or unsigned) saturation? (I asked because AVX-512 provides signed and unsigned saturation versions, as well as truncation. The non-AVX512 packs like _mm_packus_epi32 (packusdw) you were using always do saturation, you have to use plain shuffle instructions if you want packing with truncation before AVX-512. But if either is fine because the upper half is known zero, then yeah the pack instructions can be useful.)


    Single vector __m256i -> __m128i

    For a single vector, producing a narrower output, you could use vextracti128 with vshufps to pack a single __m256i into a __m128i. Before AVX-512, vshufps is one of the only 2-input shuffles that has any control input, not just a fixed interleave for example.

    In C with intrinsics, you'd need _mm_castsi128_ps and back to keep the compiler happy using _mm_shuffle_ps on integer vectors, but modern CPUs don't have bypass delays for using FP shuffles between integer SIMD instructions. Or if you're just going to store it, you can leave the result as __m128 and use _mm_store_ps((float*)p, vec); (And yes it's still strict-aliasing safe to cast integer pointers to float* because the deref happens inside the intrinsic, not in pure C).

    #include <immintrin.h>
    
    __m128 cvtepi64_epi32_avx(__m256i v)
    {
       __m256 vf = _mm256_castsi256_ps( v );      // free
       __m128 hi = _mm256_extractf128_ps(vf, 1);  // vextractf128
       __m128 lo = _mm256_castps256_ps128( vf );  // also free
       // take the bottom 32 bits of each 64-bit chunk in lo and hi
       __m128 packed = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(2, 0, 2, 0));  // shufps
       //return _mm_castps_si128(packed);  // if you want
       return packed;
    }   
    

    This is 2 shuffles per 128-bit of output data. We can do better: 2 shuffles per 256-bit of output data. (Or even just 1 if we can have our input arranged nicely).


    2x __m256i inputs producing a __m256i output

    Fortunately clang spotted a better optimization than I did. I thought of 2x vpermd + vpblendd could do it, shuffling the low32 of each element in one vector to the bottom lane, or top lane in the other. (With a set_epi32(6,4,2,0, 6,4,2,0) shuffle control).

    But clang optimized that into vshufps to get all the elements we want into one vector, then vpermpd (equivalent to vpermq) to get them into the correct order. (This is generally a good strategy, and I should have thought of that myself. :P Again, it's taking advantage of vshufps as a 2-input shuffle.)

    Translating that back into intrinsics, we get code that will compile to that efficient asm for GCC or other compilers (Godbolt compiler explorer for this and the original):

    // 2x 256 -> 1x 256-bit result
    __m256i pack64to32(__m256i a, __m256i b)
    {
        // grab the 32-bit low halves of 64-bit elements into one vector
       __m256 combined = _mm256_shuffle_ps(_mm256_castsi256_ps(a),
                                           _mm256_castsi256_ps(b), _MM_SHUFFLE(2,0,2,0));
        // {b3,b2, a3,a2 | b1,b0, a1,a0}  from high to low
    
        // re-arrange pairs of 32-bit elements with vpermpd (or vpermq if you want)
        __m256d ordered = _mm256_permute4x64_pd(_mm256_castps_pd(combined), _MM_SHUFFLE(3,1,2,0));
        return _mm256_castpd_si256(ordered);
    }
    

    It compiles to just 2 instructions with immediate shuffle controls, no vector constants. The source looks verbose but it's mostly just casts to keep the compiler happy about types.

    # clang -O3 -march=haswell
    pack64to32:                             # @pack64to32
            vshufps ymm0, ymm0, ymm1, 136           # ymm0 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]
            vpermpd ymm0, ymm0, 216                 # ymm0 = ymm0[0,2,1,3]
            ret
    

    With input reordering to avoid lane-crossing: one vshufps

    If you can arrange pairs of input vectors so they have 64-bit elements in {a0, a1 | a4, a5} and {a2, a3 | a6, a7} order, you only need an in-lane shuffle: the low 4x 32-bit elements come from the low halves of each 256-bit input, etc. You can get the job done with one _mm256_shuffle_ps. (Exactly as above, not needing the _mm256_permute4x64_pd). Credit to @chtz in comments under the question for this suggestion.

    Mandelbrot doesn't need an interaction between elements, so you could probably use pairs of __m256i vectors with that arrangement of 64-bit elements.

    If you're starting with an unrolled loop with something like {0,1,2,3} and {4,5,6,7} and using _mm256_add_epi64 with set1_epi64(8) to increment, you could instead start with {0,1,4,5} and {2,3,6,7} and everything should work identically. (Unless you're doing something else where it matters what order your elements are in?)


    Asm vs. intrinsic names

    Consult Intel's intrinsics guide to search by asm mnemonic; they're shorter to type and easier to think about in terms of what the machine can actually do, and asm mnemonics are needed for looking up performance in https://uops.info/table.html / https://agner.org/optimize/