I have two UInt64 (i.e. 64-bit quadword) integers.
sizeof(UInt64)
) boundary (i could also align them to 16-byte if that's useful for anything)How do i load them into an xmm register, e.g. xmm0
:
I've found:
movq xmm0, v[0]
but that only moves v[0]
, and sets the upper 64-bits in xmm0
to zeros:
xmm0
0000000000000000 24FC18D93B2C9D8F
As W. Chang pointed out, the endiannessification is little, and i'm ok with it being other way around:
My conundrum is how to get them in, and get them out.
For an unaligned 128-bit load, use:
movups xmm0, [v0]
: move unaligned single-precision floating point for float
or double
data. (movupd
is 1 byte longer but never makes a performance difference.)movdqu xmm0, [v0]
: move unaligned double quadwordEven if the two quadwords are split across a cache-line boundary, that's normally the best choice for throughput. (On AMD CPUs, there can be a penalty when the load doesn't fit within an aligned 32 byte block of a cache line, not just a 64-byte cache-line boundary. But on Intel, any misalignment within a 64-byte cache line is free.)
If your loads are feeding integer-SIMD instructions, you probably want movdqu
, even though movups
is 1 byte shorter in machine code. Some CPUs may care about "domain crossing" for different types of loads. For stores it doesn't matter, many compilers always use movups
even for integer data.
See also How can I accurately benchmark unaligned access speed on x86_64 for more about the costs of unaligned loads. (SIMD and otherwise).
If there weren't contiguous, your best bet is
movq xmm0, [v0]
: move quadwordmovhps xmm0, [v1]
: move high packed single-precision floating point. (No integer equivalent, use this anyway. Never use movhpd
, it's longer for no benefit because no CPUs care about double vs. float shuffles.)Or on an old x86, like Core2 and other old CPUs where movups
was slow even when the 16 bytes all came from within the same cache line, you might use
movq xmm0, [v0]
: move quadwordmovhps xmm0, [v0+8]
: move high packed single-precision floating pointmovhps
is slightly efficient than SSE4.1 pinsrq xmm0, [v1], 1
(2 uops, can't micro-fuse on Intel Sandybridge-family: 1 uop for loads ports, 1 for port 5). movhps
is 1 micro-fused uop, but still needing the same back-end ports: load + shuffle.
See Agner Fog's x86 optimization guide; he has a chapter about SIMD with a big section on data movement. https://agner.org/optimize/ And see other links in https://stackoverflow.com/tags/x86/info.
To get the data back out, movups
can work as a store, and so can movlps
/movhps
to scatter the qword halves. (But don't use movlps
as a load- it merges creating a false dependency vs. movq
or movsd
.)
movlps
is 1 byte shorter than movq
, but both can store the low 64 bits of an xmm register to memory. Compilers often ignore domain-crossing (vec-int vs. vec-fp) for stores, so you should too: generally use SSE1 ...ps
instructions when they're exactly equivalent for stores. (Not for reg-reg moves; Nehalem can slow down on movaps
between integer SIMD like paddd
, or vice versa.)
In all cases AFAIK, no CPUs care about float
vs. double
for anything other than actual add / multiply instructions, there aren't CPUs with separate float
and double
bypass-forwarding domains. The ISA design leaves that option open, but in practice there's never a penalty for saving a byte by using movups
or movaps
to copy around a vector of double
. Or using movlps
instead of movlpd
. double
shuffles are sometimes useful, because unpcklpd
is like punpcklqdq
(interleave 64-bit elements) vs. unpcklps
being like punpckldq
(interleave 32-bit elements).