Search code examples
cassemblyshufflesse2

Any preference to SHUFPD or PSHUFD for reversing two packed double in an XMM?


Question today is fairly short. Consider the following toy C program shuffle.c for reversing two packed double in register xmm0:

#include <stdio.h>

void main () {
  double x[2] = {0.0, 1.0};
  asm volatile (
    "movupd  (%[x]), %%xmm0\n\t"
    "shufpd  $1, %%xmm0, %%xmm0\n\t"  /* method 1 */
    //"pshufd  $78, %%xmm0, %%xmm0\n\t"  /* method 2 */
    "movupd  %%xmm0, (%[x])\n\t"
    :
    : [x] "r" (x)
    : "xmm0", "memory");
  printf("x[0] = %.2f, x[1] = %.2f\n", x[0], x[1]);
  }

After a dry run: gcc -msse3 -o shuffle shuffle.c | ./test, both methods/instructions will return the correct result x[0] = 1.00, x[1] = 0.00. This page says that shufpd has a latency of 6 cycles, while the intel intrinsic guide says that pshufd only has a latency of 1 cycles. This sounds like great preference to pshufd. However, This instruction is truly for packed integers. When using it for packed doubles, will there be any penalty associated with "wrong type"?

As a similar question, I also heard that instruction movaps is 1-byte smaller than movapd, and they do the same thing by reading 128bits from a 16-bit aligned address. So can we always use the former for move (between XMMs) / load (from memory) / store (to memory)? This seems crazy. I think there must be some reason to reject this. Can someone give me an explanation? Thank you.


Solution

  • You'll always get correct results, but it can matter for performance.

    Prefer FP shuffles for FP data that will be an input to FP math instructions (like addps or vfma..., as opposed to insns like xorps).

    This avoids any extra bypass-delay latency on some microarchitectures, including potentially current Intel chips. See Agner Fog's microarchitecture guide. AMD Bulldozer-family does all shuffles in the vector-integer domain, so there's a bypass delay whichever shuffle you use.

    If it saves instructions, it can be worth it to use an integer shuffle anyway. (But usually it's the other way around, where you want to use shufps to combine data from two integer vectors. That's fine in even more cases, and mostly a problem only on Nehalem, IIRC.)


    http://x86.renejeschke.de/html/file_module_x86_id_293.html lists the latency for CPUID 0F3n/0F2n CPUs, i.e. Pentium4 (family 0xF model 2 (Northwood) / model 3 (Prescott)). Those numbers are obviously totally irrelevant, and don't even match Agner Fog's P4 table for shufpd.

    Intel's intrinsics guide sometimes has numbers that don't match experimental testing, either. See Agner Fog's instruction tables for good latency/throughput numbers, and microarch guides to understand the details.


    movaps vs. movapd: No existing microarchitectures care which you use. It would be possible for someone in the future to design an x86 CPU that kept double vectors separate from float vectors internally, but for now the only distinction has been int vs. FP.

    Always prefer the ps instruction when the behaviour is identical (xorps over xorpd, movhps over movhpd).


    Some compilers (maybe both gcc and clang, I forget) will compile a _mm_store_si128 integer vector store to movaps, because there's no performance downside on any existing hardware, and it's one byte shorter.

    IIRC, there's also no perf downside to loading integer vector data with movaps / movups, but I'm less sure about that.

    There is a perf downside to using the wrong mov instruction for a reg-reg move, though. movdqa xmm1, xmm2 between two FP instructions is bad on Nehalem.


    re: your inline asm:

    It doesn't need to be volatile, and you could drop the "memory" clobber if you used a 16 byte struct or something as a "+m" input/output operand. Or a "+x" vector-register operand for an __m128d variable.

    You'll probably get better results from intrinsics than from inline asm, unless you write whole loops in inline asm or stand-alone functions.

    See the tag wiki for a link to my inline asm guide.