Nibble shuffling with x64 SIMD

I'm aware of byte shuffling instructions, but I'd like to do the same with nibbles (4-bit values), concretely I'd like to shuffle 16 nibbles in a 64-bit word. My shuffling indices are also stored as 16 nibbles. What's the most efficient implementation of this?

Solution

Arbitrary shuffles with a control vector that has to be stored this way? Ugh, hard to work with. I guess you'd have to unpack both to feed SSSE3 pshufb and then re-pack that result.

Probably just punpcklbw against a right-shifted copy, then AND mask to keep only the low 4 bits in each byte. Then pshufb.

Sometimes an odd/even split is easier than widening each element (so bits just stay within their original byte or word). In this case, if we could change your nibble index numbering, punpcklqdq could put the odd or even nibbles in the high half, ready to bring them back down and OR.

But without doing that, re-packing is a separate problem. I guess combine adjacent pairs of bytes into a word in the low byte, perhaps with pmaddubsw if throughput is more important than latency. Then you can packuswd (against zero or itself) or pshufb (with a constant control vector).

If you were doing multiple such shuffles, you could pack two vectors down to one, to store with movhps / movq. Using AVX2, it might be possible to have all the other instructions working on two independent shuffles in the two 128-bit lanes.

// UNTESTED, requires only SSSE3
#include <stdint.h>
#include <immintrin.h>

uint64_t shuffle_nibbles(uint64_t data, uint64_t control)
{
  __m128i vd = _mm_cvtsi64_si128(data);    // movq
  __m128i vd_hi = _mm_srli_epi32(vd, 4);   // x86 doesn't have a SIMD byte shift
  vd = _mm_unpacklo_epi8(vd, vd_hi);       // every nibble at the bottom of a byte, with high garbage
  vd = _mm_and_si128(vd, _mm_set1_epi8(0x0f));  // clear high garbage for later merging

  __m128i vc = _mm_cvtsi64_si128(control);
  __m128i vc_hi = _mm_srli_epi32(vc, 4);
  vc = _mm_unpacklo_epi8(vc, vc_hi);

  vc = _mm_and_si128(vc, _mm_set1_epi8(0x0f));  // make sure high bit is clear, else pshufb zeros that element.
       //  AVX-512VBMI  vpermb doesn't have that problem, if you have it available
  vd = _mm_shuffle_epi8(vd, vc);

       // left-hand input is the unsigned one, right hand is treated as signed bytes.
  vd = _mm_maddubs_epi16(vd, _mm_set1_epi16(0x1001));  // hi nibbles << 4 (*= 0x10), lo nibbles *= 1.

  // vd has nibbles merged into bytes, but interleaved with zero bytes
  vd = _mm_packus_epi16(vd, vd);  // duplicate vd into low & high halves.
  //  Pack against _mm_setzero_si128() if you're not just going to movq into memory or a GPR and you want the high half of the vector to be zero.
  return _mm_cvtsi128_si64(vd);
}

Masking the data with 0x0f ahead of the shuffle (instead of after) allows more ILP on CPUs with two shuffle units. At least if they already had the uint64_t values in vector registers, or if the data and control values are coming from memory so both can be loaded in the same cycle. If coming from GPRs, 1/clock throughput for vmovq xmm, reg means there's a resource conflict between the dep chains so they can't both start in the same cycle. But since we the data might be ready before the control, masking early keeps it off the critical path for control->output latency.

If latency is a bottleneck instead of the usual throughput, consider replacing pmaddubsw with right-shift by 4, por, and AND/pack. Or pshufb to pack while ignoring garbage in odd bytes. Since you'd need another constant anyway, might as well make it a pshufb constant instead of and.

If you had AVX-512, a shift and bit-blend with vpternlogd could avoid needing to mask the data before shuffling, and vpermb instead of vpshufb would avoid needing to mask the control, so you'd avoid the set1_epi8(0x0f) constant entirely.

clang's shuffle optimizer didn't spot anything, just compiling it as-written like GCC does (https://godbolt.org/z/xz7TTbM1d), even with -march=sapphirerapids. Not spotting that it could use vpermb instead of vpand / vpshufb.

shuffle_nibbles(unsigned long, unsigned long):
        vmovq   xmm0, rdi
        vpsrld  xmm1, xmm0, 4
        vpunpcklbw      xmm0, xmm0, xmm1        # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
        vmovq   xmm1, rsi
        vpsrld  xmm2, xmm1, 4
        vpunpcklbw      xmm1, xmm1, xmm2        # xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
        vmovdqa xmm2, xmmword ptr [rip + .LCPI0_0] # xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
        vpand   xmm0, xmm0, xmm2
        vpand   xmm1, xmm1, xmm2
        vpshufb xmm0, xmm0, xmm1
        vpmaddubsw      xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
        vpackuswb       xmm0, xmm0, xmm0
        vmovq   rax, xmm0
        ret

(Without AVX, it requires 2 extra movdqa register-copy instructions.)