I am wondering what is the best way to load a m512i from two m256is, with simple packing (zmm0 = {ymm1,ymm0}). I know ymm0 is the lower bits of zmm0, but not sure if I can leverage it in C using intrinsic. What's the best way to achieve this in C?
Strange, there doesn't seem to be a 256->512 version of _mm256_set_m128i
in Intel's intrinsics guide. Perhaps because every AVX512 intrinsic has to have a _mask_
version? No, there's still _mm512_set_epi32
, so that's odd.
You can _mm512_cast
one to __m512i
and vinserti32x8
the other into it. (Or 64x4, the choice is irrelevant if not masking.)
#include <immintrin.h>
__m256i merge256(__m128i lo, __m128i hi){
//return _mm256_set_m128i(hi, lo);
return _mm256_set_m128i(hi, lo);
}
#ifdef __AVX512F__
__m512i merge512(__m256i lo, __m256i hi){
__m512i base = _mm512_castsi256_si512(lo); // upper half is don't-care
return _mm512_inserti32x8(base, hi, 1); // insert hi as new upper half
// return _mm512_set_m256i(b, a); // doesn't exist in GCC, clang, ICC, or MSVC
}
#endif
Demo on Godbolt, also including 128->256 with _mm256_set_m128i(hi, lo)
I defined the arg order as lo, hi for these examples. You may prefer to define it as hi, lo to match the _mm_set
(rather than setr
) intrinsics.