Search code examples
x86intelsimdnamingavx

What is the difference between shuffle and permute


In x86-64 SIMD instruction names, as well as the intrinsic functions you can use to access them from C/C++, you find both the terms shuffle (e.g., _mm_shuffle_epi32) and permute (e.g., _mm_permute_pd).

Superficially, they seem to both be used for data movement stuff. What's the difference?


Solution

  • I haven't looked for inspiration outside of x86. I don't think there's any standard convention here.


    I think they just switched from "shuffle" to "permute" at some point in time between SSSE3 pshufb and AVX1 vpermilps/pd / vperm2f128. Everything before AVX is called "shuffle", everything after is called "permute".

    (SSE4.x didn't introduce any instructions named "shuffle" or "permute", just pinsrd / pextrd and other operand sizes being the main shuffles that SSE4.1 added)


    There are 2 exceptions to that, not counting VEX / EVEX encodings of vshufps, vpshufd, etc.:

    AVX512F VSHUFF32X4 (and 64x2 and integer versions) 128-bit granularity lane shuffles with immediate control have the same design as vshufps: low half of the destination selects elements from the first source, high half selects from the 2nd source. e.g. _mm512_shuffle_i64x2(__m512i a, __m512i b, int imm); This naming is helpful to remember how the shuffle control works. With 4 output lanes, there's only room for 4x 2-bit selectors, not 4x 3-bit. The 256-bit operand-size version still has the same limitation so it only uses the low 2 bits of the immediate, like shufpd.

    AVX512BITALG VPSHUFBITQMB is like vpmultishiftqb (parallel bitfield extract) + vector->mask (like a movemask). So it can select any 8 bits within each qword chunk of the input.

    AVX512 256-bit granularity operations currently only exist with names like VEXTRACTF32x8 and VINSERTF32x8, not shuf or perm.


    Intrinsic name do match instruction mnemonics as far as shuffle vs. permute, but otherwise can leave out "in lane" when the mnemonic has it, requiring the lane-crossing version to be different too. (e.g. AVX1 vpermilps = _mm_permute_ps imm8 or _mm_permutevar_ps __m128i control vs. AVX2 vpermps = _mm256_permutexvar_ps; not available with immediate control, but vpermpd is.

    Intel's intrinsics guide only lists _mm256_permutevar8x32_ps for the AVX2 form of vpermps, while the ISA ref manual only lists permutexvar. (update: which later versions of the intrinsics guide list as AVX-512; GCC agrees with that, refusing to compile it without AVX-512 enabled). Anyway, weird choice, 8x32 sounds like an AVX512 instruction (with per-element masking). They did need a way to distinguish it from AVX1 vpermilps where they used up the obvious name, _mm256_permute_ps and _mm256_permutevar_ps.


    There are no other pattern I've noticed. We can easily rule out all of the following hypotheses:

    • copy-and-shuffle (pshufd xmm, xmm/mem, imm) vs. in-place shuffle (pshufb data, idx or shufps xmm, xmm, imm)
    • immediate control vs. variable-control (pshufd vs. pshufb or AVX2 vpermd vs. vperm2i128)
    • integer vs. FP (SSE2 pshufd vs. shufps/pd ; AVX2 vpermps vs. vpermd)
    • 1-source vs. 2-source (pshufd vs. shufps ; AVX2 vpermd vs. AVX512 vpermt2d)
    • lane-crossing vs. in-lane (AVX1 vpermilps vs. AVX2 vpermps)

    The shuffle-control immediate works the same way in pshufd and vpermq-immediate. But unlike the "tricky" vshuff32x4 case, both pshufd and vpermq work the obvious way so no need to make an analogy to another mnemonic. Also, "pshuf" is a bit awkward vs. "shuf" or "perm", so I can see why they'd want something else for packed-integer.

    Note that the "shuf" names go all the way back to SSE1 shufps, introduced by Pentium III (Katmai) at the same time as MMX2 pshufw mm, mm, imm8.

    P5 Pentium MMX didn't have any instructions named shuf/perm instructions, just the punpckl/h shuffles in various sizes.

    https://nasm.us/doc/nasmdocb.html#section-B.1.7 (That NASM appendix is helpful because it sorts mnemonics into groups by order of introduction. It's what got me to notice the vshuff32x4 mnemonic down in the AVX512 stuff after I thought they'd switched to calling everything "perm".)