I use a system where xmm0 has 128 bit. I want to set [63...0] to zero, whithout affecting [127...64]. I use:
MOV RAX, 0xFFFFFFFFFFFFFFFF
MOVQ xmm2, RAX
PSHUFD xmm2, xmm2, 0b00001111
PAND xmm1, xmm2
Is there a faster way?
You can create the constant somewhat more efficiently via
pcmpeqd xmm2,xmm2 ; xmm2 = all-ones. Needs any ALU port
pslldq xmm2, 8 ; left shift by 8 bytes. Needs the shuffle port
PAND xmm1, xmm2
(See also Agner Fog's optimization guide; he has a section on creating constants on the fly. Also What are the best instruction sequences to generate vector constants on the fly?)
Or as @RossRidge suggested, using a memory source operand for the constant may be most efficient if you need it often enough to stay hot in cache, but can't just hoist it out of a loop and keep it in a register.
Or blend in a new low 8 bytes of zeros.
pxor xmm2, xmm2 ; xmm2=0; very efficient on Intel CPUs; no back-end uop
movsd xmm1, xmm2 ; runs on port5 only on Intel CPUs, like shuffles.
(As a load from memory, movsd
zero-extends. But for reg-reg moves it and movss
leave the destination upper part unmodified.)
Alternate ways to blend are more efficient but require more than SSE2:
pblendw xmm1, xmm2, 0b00001111
- Worse on everything (or equal speed but worse code-size). Still only runs on port5 on Intel. Ryzen runs movsd xmm,xmm
on even more ports than pblendw
. Low-power Atom/Silvermont runs movsd on more ports than pblendw, but Goldmont and KNL have 2/clock throughput for this and movsd. So it's still never better than movsd.blendpd xmm1, xmm2, 0b01
(or blendps
) - as efficient as vpblendd, but will incur bypass forwarding latency if used between integer instructions. If you're bottlenecked on throughput this might be ok, especially if you have to avoid back-end pressure.vpblendd xmm1, xmm1, xmm2, 0b0011
- runs on any ALU port on any AVX2 CPU.Some CPUs might also have a bypass delay for movsd
between integer instructions, but Sandybridge-family is pretty forgiving for shuffles.
Equally efficient as movsd
on some CPUs, requiring only SSE1:
movhlps xmm1, xmm2
- replace the low qword of xmm1 with the high qword of xmm2 (also zero). Less efficient on Ryzen or Silvermont.Similarly, shufpd
and shufps
could copy the top half of xmm1
into the top half of a zeroed register. (Useful if you don't want to destroy the original reg). But you can do that with movsd
just as easily and more efficiently.
Also possible: movlps xmm, [mem]
load of zeros, possibly that you just stored to the stack. It doesn't allow a register source operand, and needs a port5 uop on Intel (shuffle / uncommon blend). It can micro-fuse into one fused-domain uop but it's mostly worse than pand
with a memory source because it can run on fewer ports.
insertps
instruction:SSE4.1 insertps
can do this in one instruction. (Insert an element from itself to itself, then apply zeroing). It's an FP shuffle so some CPUs might have extra bypass-forwarding latency between it and surrounding integer instructions, but probably not Intel Sandybridge-family CPUs. (Nehalem would have this penalty, but is old enough not to care about.)
insertps xmm1, xmm1, 0b00_00_0011 ; fields are: src elem, dst elem, zmask
; NASM syntax allows _ between digits in a number, like C++ allows '
If you need to do it repeatedly, can be worth creating a vector constant for a cheaper instruction, like pand
or vpblendd
Clang optimizes v = _mm_insert_ps(v,v, 0b00'00'0011);
into vxorps
/ vblendps xmm,xmm, 3
, same as for using GNU C native vector syntax to do v[0] = 0;
on a __m128d
, so that confirms I got the constant correct. Godbolt
GCC unfortunately uses vmovsd
or even vpinsrq
from an integer register for v[0] = 0;
, even when optimizing for -march=skylake
so it should know that those aren't the cheapest instructions for that CPU.