Expand the lower two 32-bit floats of an xmm register to the whole xmm register

What is the most efficient way in Intel x86 assembly to do the following operation (a, b are 32-bit floats):

From xmm1: [-, -, a, b] to xmm1: [a, a, b, b]

I could not find any useful instructions.
My idea is to copying a and b to other registers and then shift the xmm1 register 4 bytes and move a or b to the lowest 4 bytes.

Solution

You're looking for unpcklps xmm1, xmm1 (https://www.felixcloutier.com/x86/unpcklps) to interleave the low elements from a register with itself:
low element -> bottom 2, 2nd lowest to high 2.

You could instead use shufps but that wouldn't be any better in this case, and would need an immediate byte. To copy-and-shuffle, you could use pshufd, but on a few CPUs that integer instruction is slower between FP instructions (but it's still typically better than a movaps + unpcklps. There's either no bypass latency, or it's 1 cycle and movaps would cost the same latency but also some throughput resources. Except Nehalem where bypass latency would be 2 cycles. I don't think any CPUs with mov-elimination have bypass latency for shuffles, but maybe some AMD do.)

If you were having trouble finding the right shuffle instruction, consider writing it in C and seeing if clang can turn it into a shuffle for you. Like _mm_set_ps(v[1], v[1], v[0], v[0]). In general that won't always compile to good asm, but worth a try with clang -O3 (clang has a very good shuffle optimizer). In this case both GCC and clang figure out how to do that with one unpcklps xmm0,xmm0 (https://godbolt.org/z/o6PTeP) instead of the disaster that was possible. Or the reverse with shufps xmm0,xmm0, 5 (5 is 0b00'00'01'01).

(Note that indexing a __m128 as v[idx] is a GNU extension, but I'm only suggesting doing it with clang to find a good shuffle. If you ultimately want intrinsics, check clang's asm then use the intrinsic for that in your code, not a _mm_set)

Also see the SIMD chapter in Agner Fog's optimization guide (https://agner.org/optimize/); he has a good table of instructions to consider for different kinds of data movement. Also https://www.officedaytime.com/simd512e/simd.html has a good visual quick-reference, and https://software.intel.com/sites/landingpage/IntrinsicsGuide/ lets you filter by category (Swizzle = shuffles), and by ISA level (so you can exclude AVX512 which has a bazillion versions of every intrinsic with masking.)

See also https://stackoverflow.com/tags/sse/info for these links and more.

If you don't know the available instructions well (and the CPU-architecture / performance tuning details), you're probably better off using C with intrinsics. The compiler can find better ways when you come up with a less efficient way to do a shuffle. e.g. compilers would hopefully optimize _mm_shuffle_ps(v,v, _MM_SHUFFLE(1,1,0,0)) into unpcklps for you.

It's very rare that hand-written asm is the right choice, especially for x86. Compilers generally do a good job with intrinsics, especially GCC and clang. If you didn't know that unpcklps existed, you're probably a long way from being able to beat the compiler easily / routinely.