Search code examples
x86sseintrinsicsmicro-optimizationsse3

What is the difference between _mm_movehdup_ps and _mm_shuffle_ps in this case?


If my understanding is correct,

_mm_movehdup_ps(a)

gives the same result as

_mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 1, 3, 3))?

Is there a performance difference the two?


Solution

  • _MM_SHUFFLE takes the high element first, so _MM_SHUFFLE(3,3, 1,1) would do the movshdup shuffle.

    The main difference is at the assembly level; movshdup is a copy-and-shuffle, avoiding a movaps to copy the input if the input a is still needed later (e.g. as part of a horizontal sum: see Fastest way to do horizontal float vector sum on x86 for an example of how it compiles without a movaps vs. the SSE1 version that uses shufps.

    movshdup/movsldup can also be a load+shuffle with a memory source operand. (shufps obviously can't, because it needs the same input twice.) On modern Intel CPUs (Sandybridge-family), movshdup xmm0, [rdi] decodes to a pure load uop, not micro-fused with an ALU uop. So it doesn't compete for ALU shuffle throughput (port 5) against other shuffles. The load ports contain logic to do broadcast loads (including movddup 64-bit broadcast), and movs[lh]dup duplication of pairs of elements. More complicated load+shuffle like vpermilps xmm0, [rdi], 0x12 or pshufd xmm, [rdi], 0x12 do still decode to multiple uops, possibly micro-fused into a load+ALU depending on the uarch.


    Both instructions are the same length: movshdup avoids the immediate byte, but shufps is an SSE1 instruction so it only has a 2-byte opcode, 1 byte shorter than SSE2 and SSE3 instructions. But with AVX enabled, vmovshdup does save a byte, because the opcode-size advantage goes away.


    On older CPUs with only 64-bit shuffle units (like Pentium-M and first-gen Core 2 (Merom)), there was a larger performance advantage. movshdup only shuffles within 64-bit halves of the vector. On Core 2 Merom, movshdup xmm, xmm decodes to 1 uop, but shufps xmm, xmm, i decodes to 3 uops. (See https://agner.org/optimize/ for instruction tables and microarch guide). See also my horizontal sum answer (linked earlier) for more about SlowShuffle CPUs like Merom and K8.


    In C++ with intrinsics

    If SSE3 is enabled, it's a missed optimization if your compiler doesn't optimize _mm_shuffle_ps(a, a, _MM_SHUFFLE(3, 3, 1, 1)) into the same assembly it would make for _mm_movehdup_ps(a).

    Some compilers (like MSVC) don't typically optimize intriniscs, though, so it's up to the programmer to understand the asm implications of avoiding movaps instructions by using intrinsics for copy-and-shuffle instructions (like pshufd and movshdup) instead of shuffles that necessarily destroy their destination register (like shufps, and like psrldq byte-shifts.)

    Also MSVC doesn't let you enable compiler use of SSE3, you only get instructions beyond the baseline SSE2 (or no SIMD) if you use intrinsics for them. Or if you enable AVX, that would allow the compiler to use SSE4.2 and earlier as well, but it still chooses not to optimize. So again, up to the human programmer to find optimizations. ICC is similar. Sometimes this can be a good thing if you know exactly what you're doing and are checking the compiler's asm output, because sometimes gcc or clang's optimizations can pessimize your code.

    Probably a good idea to compile with clang and see if it uses the same instructions as the intrinsics in your source; it has by far the best shuffle optimizer out of any of the 4 major compilers that support Intel intrinsics, basically optimizing your intrinsics code the same way compilers normally optimize pure C, i.e. just following the as-if rule to produce the same result.

    The most trivial example:

    #include <immintrin.h>
    
    __m128 shuf1(__m128 a) {
        return _mm_shuffle_ps(a,a, _MM_SHUFFLE(3,3, 1,1));
    }
    

    compiled with gcc/clang/MSVC/ICC on Godbolt

    GCC and clang with -O3 -march=core2 both spot the optimization:

    shuf1:
            movshdup        xmm0, xmm0
            ret
    

    ICC -O3 -march=haswell and MSVC -O2 -arch:AVX -Gv (to enable the vectorcall calling convention, instead of passing SIMD vectors by reference.)

    shuf1:
            vshufps   xmm0, xmm0, xmm0, 245                         #4.12
            ret                                                     #4.12