Search code examples
assemblyx86ssesimdmicro-optimization

X86: How to set lower half of xmm0 to 0, without affecting the upper half?


I use a system where xmm0 has 128 bit. I want to set [63...0] to zero, whithout affecting [127...64]. I use:

MOV RAX, 0xFFFFFFFFFFFFFFFF
MOVQ xmm2, RAX
PSHUFD xmm2, xmm2, 0b00001111
PAND xmm1, xmm2

Is there a faster way?


Solution

  • You can create the constant somewhat more efficiently via

    pcmpeqd xmm2,xmm2       ; xmm2 = all-ones.  Needs any ALU port
    pslldq  xmm2, 8         ; left shift by 8 bytes.  Needs the shuffle port
    
    PAND    xmm1, xmm2
    

    (See also Agner Fog's optimization guide; he has a section on creating constants on the fly. Also What are the best instruction sequences to generate vector constants on the fly?)

    Or as @RossRidge suggested, using a memory source operand for the constant may be most efficient if you need it often enough to stay hot in cache, but can't just hoist it out of a loop and keep it in a register.


    Or blend in a new low 8 bytes of zeros.

    pxor   xmm2, xmm2       ; xmm2=0; very efficient on Intel CPUs; no back-end uop
    
    movsd  xmm1, xmm2       ; runs on port5 only on Intel CPUs, like shuffles.
    

    (As a load from memory, movsd zero-extends. But for reg-reg moves it and movss leave the destination upper part unmodified.)

    Alternate ways to blend are more efficient but require more than SSE2:

    • SSE4.1: pblendw xmm1, xmm2, 0b00001111 - Worse on everything (or equal speed but worse code-size). Still only runs on port5 on Intel. Ryzen runs movsd xmm,xmm on even more ports than pblendw. Low-power Atom/Silvermont runs movsd on more ports than pblendw, but Goldmont and KNL have 2/clock throughput for this and movsd. So it's still never better than movsd.
    • SSE4.1 blendpd xmm1, xmm2, 0b01 (or blendps) - as efficient as vpblendd, but will incur bypass forwarding latency if used between integer instructions. If you're bottlenecked on throughput this might be ok, especially if you have to avoid back-end pressure.
    • AVX2: vpblendd xmm1, xmm1, xmm2, 0b0011 - runs on any ALU port on any AVX2 CPU.

    Some CPUs might also have a bypass delay for movsd between integer instructions, but Sandybridge-family is pretty forgiving for shuffles.

    Equally efficient as movsd on some CPUs, requiring only SSE1:

    • movhlps xmm1, xmm2 - replace the low qword of xmm1 with the high qword of xmm2 (also zero). Less efficient on Ryzen or Silvermont.

    Similarly, shufpd and shufps could copy the top half of xmm1 into the top half of a zeroed register. (Useful if you don't want to destroy the original reg). But you can do that with movsd just as easily and more efficiently.


    Also possible: movlps xmm, [mem] load of zeros, possibly that you just stored to the stack. It doesn't allow a register source operand, and needs a port5 uop on Intel (shuffle / uncommon blend). It can micro-fuse into one fused-domain uop but it's mostly worse than pand with a memory source because it can run on fewer ports.


    With a single insertps instruction:

    SSE4.1 insertps can do this in one instruction. (Insert an element from itself to itself, then apply zeroing). It's an FP shuffle so some CPUs might have extra bypass-forwarding latency between it and surrounding integer instructions, but probably not Intel Sandybridge-family CPUs. (Nehalem would have this penalty, but is old enough not to care about.)

     insertps  xmm1, xmm1, 0b00_00_0011    ; fields are: src elem, dst elem, zmask
      ; NASM syntax allows _ between digits in a number, like C++ allows '
    

    If you need to do it repeatedly, can be worth creating a vector constant for a cheaper instruction, like pand or vpblendd

    Clang optimizes v = _mm_insert_ps(v,v, 0b00'00'0011); into vxorps / vblendps xmm,xmm, 3, same as for using GNU C native vector syntax to do v[0] = 0; on a __m128d, so that confirms I got the constant correct. Godbolt

    GCC unfortunately uses vmovsd or even vpinsrq from an integer register for v[0] = 0;, even when optimizing for -march=skylake so it should know that those aren't the cheapest instructions for that CPU.