Given an 128-bit xmm
register that is packed with two quadwords (i.e. two 64-bit integers):
╭──────────────────┬──────────────────╮
xmm0 │ ffeeddccbbaa9988 │ 7766554433221100 │
╰──────────────────┴──────────────────╯
How can i perform a rotate on the individual quadwords? For example:
prorqw xmm0, 32 // rotate right packed quadwords
╭──────────────────┬──────────────────╮
xmm0 │ bbaa9988ffeeddcc │ 3322110077665544 │
╰──────────────────┴──────────────────╯
I know SSE2 provides:
PSHUFW
: shuffle packed words (16-bits)PSHUFD
: shuffle packed doublewords (32-bits)Although i don't know what the instructions do, nor is there a quadword (64-bit) version.
How would you perform a ROR
of an xmm
register - assuming packed data of other sizes?
Rotate Right Packed doublewords by 16-bits:
╭──────────┬──────────┬──────────┬──────────╮
xmm0 │ ffeeddcc │ bbaa9988 │ 77665544 │ 33221100 │
╰──────────┴──────────┴──────────┴──────────╯
⇓
╭──────────┬──────────┬──────────┬──────────╮
xmm0 │ ddccffee │ 9988bbaa │ 55447766 │ 11003322 │
╰──────────┴──────────┴──────────┴──────────╯
Rotate Right Packed Words by 8-bits:
╭──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────╮
xmm0 │ ffee │ ddcc │ bbaa │ 9988 │ 7766 │ 5544 │ 3322 │ 1100 │
╰──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────╯
⇓
╭──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────╮
xmm0 │ eeff │ ccdd │ aabb │ 8899 │ 6677 │ 4455 │ 2233 │ 0011 │
╰──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────╯
How would you perform the above if it was a 256-bit ymm
register?
╭──────────────────────────────────┬──────────────────────────────────╮
ymm0 │ 2f2e2d2c2b2a29282726252423222120 │ ffeeddccbbaa99887766554433221100 │ packed doublequadwords
╰──────────────────────────────────┴──────────────────────────────────╯
╭──────────────────┬──────────────────┬──────────────────┬──────────────────╮
ymm0 │ 2f2e2d2c2b2a2928 │ 2726252423222120 │ ffeeddccbbaa9988 │ 7766554433221100 │ packed quadwords
╰──────────────────┴──────────────────┴──────────────────┴──────────────────╯
╭──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────╮
ymm0 │ 2f2e2d2c │ 2b2a2928 │ 27262524 │ 23222120 │ ffeeddcc │ bbaa9988 │ 77665544 │ 33221100 │ packed doublewords
╰──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────╯
╭──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────╮
ymm0 │ 2f2e │ 2d2c │ 2b2a │ 2928 │ 2726 │ 2524 │ 2322 │ 2120 │ ffee │ ddcc │ bbaa │ 9988 │ 7766 │ 5544 │ 3322 │ 1100 │ packed words
╰──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────╯
If the rotate count is a multiple of 8, you can use byte shuffles. SSSE3 pshufb
with a control mask can handle any other multiple of 8 in one instruction.
SSE2 pshufd
can handle count=32, swapping the two halves of each qword: _MM_SHUFFLE(2,3, 0,1)
, or in asm pshufd xmm0, xmm0, 0b10_11_00_01
(NASM supports _
as an optional separator, like C++11 for numeric literals.)
SSE2 pshuflw
+ pshufhw
for multiple-of-16 counts is not bad for a version of your function without SSSE3, but you need separate shuffles for the low/high qword. (An imm8 control byte only holds four 2-bit fields.) Or with AVX2, for the odd/even qwords within each lane.
If the rotate count is not a multiple of 8, there's AVX512F vprolq zmm0, zmm1, 13
and vprorq
. Also available in variable-count version, with per-element counts from another vector instead of an immediate. vprolvq
/ vprorvq
. Also available in dword granularity, but not word or byte.
Otherwise with only SSE2 and a count that's not a multiple of 16 you need left+right shift + OR to actually implement in asm the common way of expressing a rotate in C as (x << n) | (x >> (64-n))
. (Best practices for circular shift (rotate) operations in C++ points out ways to work around the potential C UB from out of range shift counts, which isn't a problem with intrinsics or asm because the behaviour of asm and intrinsics is well-defined by Intel: SIMD shifts saturate the shift count, instead of masking it like scalar shifts.)
SSE2 has shifts with granularity as small as 16-bit, so you can do that directly.
For byte granularity, you'd need extra masking to zero out bits that shifted between bytes in a word. Efficient way of rotating a byte inside an AVX register. Or use tricks like pmullw
with a vector of power-of-2 elements, allowing variable counts per element. (Where AVX2 normally only has variable-count shifts for dword/qword).