since there is no AVX version of _mm_movelh_ps
I usually used _mm256_shuffle_ps(a, b, 0x44)
for AVX registers as a replacement. However, I remember reading in other questions, that swizzle instructions without a control integer (like _mm256_unpacklo_ps
or _mm_movelh_ps
) should be preferred if possible (for some reason I don't know). Yesterday, it occurred to me, that another alternative might be using the following:
_mm256_castpd_ps(_mm256_unpacklo_pd(_mm256_castps_pd(a), _mm256_castps_pd(b)));
Since the casts are supposed to be no-ops, is this better\equal\worse than using _mm256_shuffle_ps
regarding performance?
Also, if it is truly the case, it would be nice if somebody could explain in simple words (I have very limited understanding of assembly and microarchitecture) why one should prefer instructions without a control integer.
Thanks in advance
Additional note:
Clang actually optimizes the shuffle to vunpcklpd
: https://godbolt.org/z/9XFP8D
So it seems that my idea is not too bad. However, GCC and ICC create a shuffle instruction.
Avoiding an immediate saves 1 byte of machine-code size; that's all. It's at the bottom of the list for performance considerations, but all else equal shuffles like _mm256_unpacklo_pd
with an implicit "control" are very slightly better than an immediate control byte for that reason.
(But taking the control operand in another vector like vpermilps
can or vpermd
requires is usually worse, unless you have some weird front-end bottleneck in a long-running loop, and can load the shuffle control outside the loop. Not very plausible and at this point you'd have to be writing by hand in asm to be caring that much about code size/alignment; in C++ that's still not something you can really control directly.)
Since the casts are supposed to be no-ops, is this better\equal\worse than using
_mm256_shuffle_ps
regarding performance?
Ice Lake has 2/clock vshufps
vs. 1/clock vunpcklpd
, according to testing by uops.info on real hardware, running on port 1 or port 5. Definitely use _mm256_shuffle_ps
. The trivial extra code-size cost probably doesn't actually hurt at all on earlier CPUs, and is probably worth it for the future benefit on ICL, unless you're sure that port 5 won't be a bottleneck.
Ice Lake has a 2nd shuffle unit on port 1 that can handle some common XMM and in-lane YMM shuffles, including vpshufb
and apparently some 2-input shuffles like vshufps
. I have no idea why it doesn't just decode vunpcklpd
as a vshufps
with that control vector, or otherwise manage to run that shuffle on port 1. We know the shuffle HW itself can do the shuffle so I guess it's just a matter of control hardware to set up implicit shuffles, mapping an opcode to a shuffle control somehow.
Other than that, it's equal or better on older AVX CPUs; no CPUs have penalties for using PD shuffles between other PS instructions. The only different on any existing CPUs is code-size. Old CPUs like K8 and Core 2 had faster pd
shuffles than ps
, but no CPUs with AVX have shuffle units with that weakness. Also, AVX non-destructive instructions level differences between which operand has to be the destination.
As you can see from the Godbolt link, there are zero extra instructions before/after the shuffle. The "cast" intrinsics aren't doing conversion, just reinterpret to keep the C++ type system happy because Intel decided to have separate types for __m256
vs. __m256d
(vs. __m256i
), instead of having one generic YMM type. They chose not to have separate uint8x16
vs. uint32x4
vectors the way ARM did, though; for integer SIMD just __m256i
.
So there's no need for compilers to emit extra instructions for casts, and in practice that's true; they don't introduce extra vmovaps
/apd
register copies or anything like that.
If you're using clang you can just write it conveniently and let clang's shuffle optimizer emit vunpcklpd
for you. Or in other cases, do whatever it's going to do anyway; sometimes it makes worse choices than the source, often it does a good job.
Clang gets this wrong with -march=icelake-client
, still using vunpcklpd
even if you write _mm256_shuffle_ps
. (Or depending on surrounding code, might optimize that shuffle into part of something else.)