I have a float array containing A,B,C,D 4 float numbers and I wish to load them into a __m256
variable like AABBCCDD. What's the best way to do this?
I know using _mm256_set_ps()
is always an option but it seems slow with 8 CPU instructions. Thanks.
If your data was the result of another vector calculation (and in a __m128), you'd want AVX2 vpermps
(_mm256_permutevar8x32_ps
) with a control vector of _mm256_set_epi32(3,3, 2,2, 1,1, 0,0)
.
vpermps ymm
is 1 uop on Intel, but 2 uops on Zen2 (with 2 cycle throughput). And 3 uops on Zen1 with one per 4 clock throughput. (https://uops.info/)
If it was the result of separate scalar calculations, you might want to shuffle them together with _mm_set_ps(d,d, c,c)
(1x vshufps) to set up for a vinsertf128.
But with data in memory, I think your best bet is a 128-bit broadcast-load, then an in-lane shuffle. It only requires AVX1, and on modern CPUs it's 1 load + 1 shuffle uop on Zen2 and Haswell and later. It's also efficient on Zen1: the only lane-crossing shuffle being the 128-bit broadcast-load.
Using an in-lane shuffle is lower-latency than lane-crossing on both Intel and Zen2 (256-bit shuffle execution units). This still requires a 32-byte shuffle control vector constant, but if you need to do this frequently it will typically / hopefully stay hot in cache.
__m256 duplicate4floats(void *p) {
__m256 v = _mm256_broadcast_ps((const __m128 *) p); // vbroadcastf128
v = _mm256_permutevar_ps(v, _mm256_set_epi32(3,3, 2,2, 1,1, 0,0)); // vpermilps
return v;
}
Modern CPUs handle broadcast-loads right in the load port, no shuffle uop needed. (Sandybridge does need a port 5 shuffle uop for vbroadcastf128
, unlike narrower broadcasts, but Haswell and later are purely port 2/3. But SnB doesn't support AVX2 so a lane-crossing shuffle with granularity less than 128-bit wasn't an option.)
So even if AVX2 is available, I think AVX1 instructions are more efficient here. On Zen1, vbroadcastf128
is 2 uops, vs. 1 for a 128-bit vmovups
, but vpermps
(lane-crossing) is 3 uops vs. 2 for vpermilps
.
Unfortunately, clang pessimizes this into a vmovups
load and a vpermps ymm
, but GCC compiles it as written. (Godbolt)
If you wanted to avoid using a shuffle-control vector constant, vpmovzxdq ymm, [mem]
(2 uops on Intel) could get the elements set up for vmovsldup
(1 uops in-lane shuffle). Or broadcast-load and vunpckl/hps
then blend?
I know using _mm256_set_ps() is always an option but it seems slow with 8 CPU instructions.
Get a better compiler, then! (Or remember to enable optimization.)
__m256 duplicate4floats_naive(const float *p) {
return _mm256_set_ps(p[3],p[3], p[2], p[2], p[1],p[1], p[0],p[0]);
}
compiles with gcc (https://godbolt.org/z/dMzh3fezE) into
duplicate4floats_naive(float const*):
vmovups xmm1, XMMWORD PTR [rdi]
vpermilps xmm0, xmm1, 80
vpermilps xmm1, xmm1, 250
vinsertf128 ymm0, ymm0, xmm1, 0x1
ret
So 3 shuffle uops, not great. And it could have used vshufps
instead of vpermilps
to save code-size and let it run on more ports on Ice Lake. But still vastly better than 8 instructions.
clang's shuffle optimizer makes the same asm as with my optimized intrinsics, because that's how clang is. It's pretty decent optimization, just not quite optimal.
duplicate4floats_naive(float const*):
vmovups xmm0, xmmword ptr [rdi]
vmovaps ymm1, ymmword ptr [rip + .LCPI1_0] # ymm1 = [0,0,1,1,2,2,3,3]
vpermps ymm0, ymm1, ymm0
ret