Search code examples
cintrinsicsavx2avx512

fill a zmm from two ymms in C


I am wondering what is the best way to load a m512i from two m256is, with simple packing (zmm0 = {ymm1,ymm0}). I know ymm0 is the lower bits of zmm0, but not sure if I can leverage it in C using intrinsic. What's the best way to achieve this in C?


Solution

  • Strange, there doesn't seem to be a 256->512 version of _mm256_set_m128i in Intel's intrinsics guide. Perhaps because every AVX512 intrinsic has to have a _mask_ version? No, there's still _mm512_set_epi32, so that's odd.

    You can _mm512_cast one to __m512i and vinserti32x8 the other into it. (Or 64x4, the choice is irrelevant if not masking.)

    #include <immintrin.h>
    
    __m256i merge256(__m128i lo, __m128i hi){
            //return _mm256_set_m128i(hi, lo);
            return _mm256_set_m128i(hi, lo);
    }
    
    #ifdef __AVX512F__
    __m512i merge512(__m256i lo, __m256i hi){
        __m512i base = _mm512_castsi256_si512(lo);  // upper half is don't-care
        return _mm512_inserti32x8(base, hi, 1);     // insert hi as new upper half
        
    //        return _mm512_set_m256i(b, a);  // doesn't exist in GCC, clang, ICC, or MSVC
    }
    #endif
    

    Demo on Godbolt, also including 128->256 with _mm256_set_m128i(hi, lo)

    I defined the arg order as lo, hi for these examples. You may prefer to define it as hi, lo to match the _mm_set (rather than setr) intrinsics.