Load and duplicate 4 single precision float numbers into a packed __m256 variable with fewest instructions

I have a float array containing A,B,C,D 4 float numbers and I wish to load them into a __m256 variable like AABBCCDD. What's the best way to do this? I know using _mm256_set_ps() is always an option but it seems slow with 8 CPU instructions. Thanks.

Solution

If your data was the result of another vector calculation (and in a __m128), you'd want AVX2 vpermps (_mm256_permutevar8x32_ps) with a control vector of _mm256_set_epi32(3,3, 2,2, 1,1, 0,0).

vpermps ymm is 1 uop on Intel, but 2 uops on Zen2 (with 2 cycle throughput). And 3 uops on Zen1 with one per 4 clock throughput. (https://uops.info/)

If it was the result of separate scalar calculations, you might want to shuffle them together with _mm_set_ps(d,d, c,c) (1x vshufps) to set up for a vinsertf128.

But with data in memory, I think your best bet is a 128-bit broadcast-load, then an in-lane shuffle. It only requires AVX1, and on modern CPUs it's 1 load + 1 shuffle uop on Zen2 and Haswell and later. It's also efficient on Zen1: the only lane-crossing shuffle being the 128-bit broadcast-load.

Using an in-lane shuffle is lower-latency than lane-crossing on both Intel and Zen2 (256-bit shuffle execution units). This still requires a 32-byte shuffle control vector constant, but if you need to do this frequently it will typically / hopefully stay hot in cache.

__m256  duplicate4floats(void *p) {
   __m256 v = _mm256_broadcast_ps((const __m128 *) p);   // vbroadcastf128
   v = _mm256_permutevar_ps(v, _mm256_set_epi32(3,3, 2,2,  1,1, 0,0));  // vpermilps
   return v;
}

Modern CPUs handle broadcast-loads right in the load port, no shuffle uop needed. (Sandybridge does need a port 5 shuffle uop for vbroadcastf128, unlike narrower broadcasts, but Haswell and later are purely port 2/3. But SnB doesn't support AVX2 so a lane-crossing shuffle with granularity less than 128-bit wasn't an option.)

So even if AVX2 is available, I think AVX1 instructions are more efficient here. On Zen1, vbroadcastf128 is 2 uops, vs. 1 for a 128-bit vmovups, but vpermps (lane-crossing) is 3 uops vs. 2 for vpermilps.

Unfortunately, clang pessimizes this into a vmovups load and a vpermps ymm, but GCC compiles it as written. (Godbolt)

If you wanted to avoid using a shuffle-control vector constant, vpmovzxdq ymm, [mem] (2 uops on Intel) could get the elements set up for vmovsldup (1 uops in-lane shuffle). Or broadcast-load and vunpckl/hps then blend?

I know using _mm256_set_ps() is always an option but it seems slow with 8 CPU instructions.

Get a better compiler, then! (Or remember to enable optimization.)

__m256  duplicate4floats_naive(const float *p) {
   return _mm256_set_ps(p[3],p[3], p[2], p[2], p[1],p[1], p[0],p[0]);
}

compiles with gcc (https://godbolt.org/z/dMzh3fezE) into

duplicate4floats_naive(float const*):
        vmovups xmm1, XMMWORD PTR [rdi]
        vpermilps       xmm0, xmm1, 80
        vpermilps       xmm1, xmm1, 250
        vinsertf128     ymm0, ymm0, xmm1, 0x1
        ret

So 3 shuffle uops, not great. And it could have used vshufps instead of vpermilps to save code-size and let it run on more ports on Ice Lake. But still vastly better than 8 instructions.

clang's shuffle optimizer makes the same asm as with my optimized intrinsics, because that's how clang is. It's pretty decent optimization, just not quite optimal.

duplicate4floats_naive(float const*):
        vmovups xmm0, xmmword ptr [rdi]
        vmovaps ymm1, ymmword ptr [rip + .LCPI1_0] # ymm1 = [0,0,1,1,2,2,3,3]
        vpermps ymm0, ymm1, ymm0
        ret