How to load two packed 64-bit quadwords into a 128-bit xmm register

I have two UInt64 (i.e. 64-bit quadword) integers.

they are aligned to an 8-byte (sizeof(UInt64)) boundary (i could also align them to 16-byte if that's useful for anything)
they are packed together so they are side-by-side in memory

How do i load them into an xmm register, e.g. xmm0:

I've found:

movq xmm0, v[0]

but that only moves v[0], and sets the upper 64-bits in xmm0 to zeros:

xmm0 0000000000000000 24FC18D93B2C9D8F

Bonus Questions

How do i get them back out?
What if they're not side-by-side in memory?
What if they're 4-byte aligned?

Edit

As W. Chang pointed out, the endiannessification is little, and i'm ok with it being other way around:

My conundrum is how to get them in, and get them out.

Solution

For an unaligned 128-bit load, use:

movups xmm0, [v0]: move unaligned single-precision floating point for float or double data. (movupd is 1 byte longer but never makes a performance difference.)
movdqu xmm0, [v0]: move unaligned double quadword

Even if the two quadwords are split across a cache-line boundary, that's normally the best choice for throughput. (On AMD CPUs, there can be a penalty when the load doesn't fit within an aligned 32 byte block of a cache line, not just a 64-byte cache-line boundary. But on Intel, any misalignment within a 64-byte cache line is free.)

If your loads are feeding integer-SIMD instructions, you probably want movdqu, even though movups is 1 byte shorter in machine code. Some CPUs may care about "domain crossing" for different types of loads. For stores it doesn't matter, many compilers always use movups even for integer data.

See also How can I accurately benchmark unaligned access speed on x86_64 for more about the costs of unaligned loads. (SIMD and otherwise).

If there weren't contiguous, your best bet is

movq xmm0, [v0]: move quadword
movhps xmm0, [v1]: move high packed single-precision floating point. (No integer equivalent, use this anyway. Never use movhpd, it's longer for no benefit because no CPUs care about double vs. float shuffles.)

Or on an old x86, like Core2 and other old CPUs where movups was slow even when the 16 bytes all came from within the same cache line, you might use

movq xmm0, [v0]: move quadword
movhps xmm0, [v0+8]: move high packed single-precision floating point

movhps is slightly efficient than SSE4.1 pinsrq xmm0, [v1], 1 (2 uops, can't micro-fuse on Intel Sandybridge-family: 1 uop for loads ports, 1 for port 5). movhps is 1 micro-fused uop, but still needing the same back-end ports: load + shuffle.

See Agner Fog's x86 optimization guide; he has a chapter about SIMD with a big section on data movement. https://agner.org/optimize/ And see other links in https://stackoverflow.com/tags/x86/info.

To get the data back out, movups can work as a store, and so can movlps/movhps to scatter the qword halves. (But don't use movlps as a load- it merges creating a false dependency vs. movq or movsd.)

movlps is 1 byte shorter than movq, but both can store the low 64 bits of an xmm register to memory. Compilers often ignore domain-crossing (vec-int vs. vec-fp) for stores, so you should too: generally use SSE1 ...ps instructions when they're exactly equivalent for stores. (Not for reg-reg moves; Nehalem can slow down on movaps between integer SIMD like paddd, or vice versa.)

In all cases AFAIK, no CPUs care about float vs. double for anything other than actual add / multiply instructions, there aren't CPUs with separate float and double bypass-forwarding domains. The ISA design leaves that option open, but in practice there's never a penalty for saving a byte by using movups or movaps to copy around a vector of double. Or using movlps instead of movlpd. double shuffles are sometimes useful, because unpcklpd is like punpcklqdq (interleave 64-bit elements) vs. unpcklps being like punpckldq (interleave 32-bit elements).