How to rotate packed quadwords in xmm register?

Given an 128-bit xmm register that is packed with two quadwords (i.e. two 64-bit integers):

     ╭──────────────────┬──────────────────╮
xmm0 │ ffeeddccbbaa9988 │ 7766554433221100 │
     ╰──────────────────┴──────────────────╯

How can i perform a rotate on the individual quadwords? For example:

prorqw xmm0, 32   // rotate right packed quadwords

     ╭──────────────────┬──────────────────╮
xmm0 │ bbaa9988ffeeddcc │ 3322110077665544 │
     ╰──────────────────┴──────────────────╯

I know SSE2 provides:

PSHUFW: shuffle packed words (16-bits)
PSHUFD: shuffle packed doublewords (32-bits)

Although i don't know what the instructions do, nor is there a quadword (64-bit) version.

Bonus Question

How would you perform a ROR of an xmm register - assuming packed data of other sizes?

Rotate Right Packed doublewords by 16-bits:

     ╭──────────┬──────────┬──────────┬──────────╮
xmm0 │ ffeeddcc │ bbaa9988 │ 77665544 │ 33221100 │
     ╰──────────┴──────────┴──────────┴──────────╯
                        ⇓
     ╭──────────┬──────────┬──────────┬──────────╮
xmm0 │ ddccffee │ 9988bbaa │ 55447766 │ 11003322 │
     ╰──────────┴──────────┴──────────┴──────────╯

Rotate Right Packed Words by 8-bits:

     ╭──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────╮
xmm0 │ ffee │ ddcc │ bbaa │ 9988 │ 7766 │ 5544 │ 3322 │ 1100 │
     ╰──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────╯
                        ⇓
     ╭──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────╮
xmm0 │ eeff │ ccdd │ aabb │ 8899 │ 6677 │ 4455 │ 2233 │ 0011 │
     ╰──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────╯

Extra bonus question

How would you perform the above if it was a 256-bit ymm register?

     ╭──────────────────────────────────┬──────────────────────────────────╮
ymm0 │ 2f2e2d2c2b2a29282726252423222120 │ ffeeddccbbaa99887766554433221100 │ packed doublequadwords
     ╰──────────────────────────────────┴──────────────────────────────────╯
     ╭──────────────────┬──────────────────┬──────────────────┬──────────────────╮
ymm0 │ 2f2e2d2c2b2a2928 │ 2726252423222120 │ ffeeddccbbaa9988 │ 7766554433221100 │ packed quadwords
     ╰──────────────────┴──────────────────┴──────────────────┴──────────────────╯
     ╭──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────╮
ymm0 │ 2f2e2d2c │ 2b2a2928 │ 27262524 │ 23222120 │ ffeeddcc │ bbaa9988 │ 77665544 │ 33221100 │ packed doublewords
     ╰──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────╯
     ╭──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────╮
ymm0 │ 2f2e │ 2d2c │ 2b2a │ 2928 │ 2726 │ 2524 │ 2322 │ 2120 │ ffee │ ddcc │ bbaa │ 9988 │ 7766 │ 5544 │ 3322 │ 1100 │ packed words
     ╰──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────╯

Bonus Reading

Solution

If the rotate count is a multiple of 8, you can use byte shuffles. SSSE3 pshufb with a control mask can handle any other multiple of 8 in one instruction.

SSE2 pshufd can handle count=32, swapping the two halves of each qword: _MM_SHUFFLE(2,3, 0,1), or in asm pshufd xmm0, xmm0, 0b10_11_00_01 (NASM supports _ as an optional separator, like C++11 for numeric literals.)

SSE2 pshuflw + pshufhw for multiple-of-16 counts is not bad for a version of your function without SSSE3, but you need separate shuffles for the low/high qword. (An imm8 control byte only holds four 2-bit fields.) Or with AVX2, for the odd/even qwords within each lane.

If the rotate count is not a multiple of 8, there's AVX512F vprolq zmm0, zmm1, 13 and vprorq. Also available in variable-count version, with per-element counts from another vector instead of an immediate. vprolvq / vprorvq. Also available in dword granularity, but not word or byte.

Otherwise with only SSE2 and a count that's not a multiple of 16 you need left+right shift + OR to actually implement in asm the common way of expressing a rotate in C as (x << n) | (x >> (64-n)). (Best practices for circular shift (rotate) operations in C++ points out ways to work around the potential C UB from out of range shift counts, which isn't a problem with intrinsics or asm because the behaviour of asm and intrinsics is well-defined by Intel: SIMD shifts saturate the shift count, instead of masking it like scalar shifts.)

SSE2 has shifts with granularity as small as 16-bit, so you can do that directly.

For byte granularity, you'd need extra masking to zero out bits that shifted between bytes in a word. Efficient way of rotating a byte inside an AVX register. Or use tricks like pmullw with a vector of power-of-2 elements, allowing variable counts per element. (Where AVX2 normally only has variable-count shifts for dword/qword).