Unpack 12-bit data quickly (where the nibbles aren't contiguous; how to shuffle nibbles?)

I need to unpack 12-bit data stored packed, 2 unsigned 12-bit fields stored in 24-bits. I'd like to store them in a byte[] in a little endian uint16 order.

The packed format is a bit odd; byte[0] is the upper 8 significant bits of the first 12-bit number and and byte[2] is the upper 8 significant bits of the second 12-bit number. The middle byte[1] has the lower 4 bits of both; the first value in the lower nibble and the second value in the upper.

Here's a visual: the full boxes are bytes, letters represent nibbles. Lower addresses on the left, so SIMD bit-shift left will actually move data to the right across boxes but left within boxes.

I have written 2 working versions in C#.

        private byte[] Unpack12b2(byte[] input) //slower
        {
            int eod = input.Length / 3 * 3; //protect against file sizes that aren't a multiple of 3
            byte[] output = new byte[eod * 3 / 2];
            int j = 0;
            int loop = 0;

            for (int i = 0; i < eod; i+=3)
            {
                j = i + loop++;
                output[j] = (byte)((input[i] << 4) | (input[i + 1] & 0xf));
                output[j + 1] = (byte)(input[i] >> 4);
                output[j + 2] = (byte)((input[i + 2] << 4) | (input[i + 1] >> 4));
                output[j + 3] = (byte)(input[i + 2] >> 4);
            }
            return output;
        }

        private ushort[] Unpack12b(byte[] input) //slightly faster
        {
            int outputIndex = 0;
            int byte1, byte2, byte3;

            int eod = input.Length / 3 * 3; //protect against file sizes that aren't a multiple of 3
            ushort[] output = new ushort[eod / 3 * 2];

            for (int i = 0; i < eod; i += 3)
            {
                byte1 = input[i];
                byte2 = input[i + 1];
                byte3 = input[i + 2];

                output[outputIndex++] = (ushort)((byte1 << 4) | (byte2 & 0xf));
                output[outputIndex++] = (ushort)((byte3 << 4) | (byte2 >> 4));
            }
            return output;
        }

This is the closest answer I've found, but the packed format in that question is much easier to deal with. SIMD unpack 12-bit fields to 16-bit

I'd really like to speed this up. The output is an array of 200+ million bytes so it's a lot of loops and this function is called repeatedly.

Any thoughts on how to speed this up would be appreciated. Ideally I'd like to implement something with AVX2 in C++, but I'm getting lost in how to shuffle in nibbles rather than bytes.

Solution

If you can't use C# System.Runtime.Intrinsics.X86 to do this in C#, then yes calling out to a function created by a C++ compiler could be good. To balance between marshalling overhead vs. cache misses, you might want to work in chunks of 64K input producing 85K output, which you read while it's still hot in L2 cache, before moving on to the next chunk. But if you need random access to the unpacked output, you might be stuck with doing the whole thing at once and getting at best L3 cache hits, or even misses all the way to DRAM.

Most of the techniques from SIMD unpack 12-bit fields to 16-bit are applicable, like doing a 32-byte load that splits down the middle of two 12-byte halves, setting up for a vpshufb which will get the 3 bytes we want into each 4-byte chunk.

Moving nibbles around requires bit-shifts (or AVX-512 instructions like vpmultishiftqb, a bitfield-extract). x86 SIMD only has 16-bit and wider shifts, not 8-bit element size, unfortunately. (Left-shift by 1 bit can be done by adding to itself, otherwise you can use a wider shift and AND with a mask to clear any bits that got shifted in across byte boundaries.)

Visualizing how a left or right shift will move bits between bytes is vastly easier with a diagram where shift-left is left, which one might call "big endian", highest byte on the left like Intel uses in their manuals for SIMD shuffle instructions (e.g. punpcklbw). And like the comments in my answer on the linked 12-to-16 unpack question for a normal bit layout, where that layout makes it look like the two contiguous 12-bit fields it is (unlike here where it is indeed weird). I'd normally use letters starting with A, but picked different ones to avoid confusion with your diagram. (I used earlier letters for more-significant nibbles in this case; I sometimes go the other way around, to match up with the element numbering for shuffles, where 0 is the right-most element that gets loaded/stored from the lowest address, so D C B A in a diagram makes sense.)

high       low address
QR    ST    UV        # input format

QR ST  |  ST UV       # after a byte-shuffle replicates the middle byte while expanding

0Q RS  |  0U VT       # desired output.

 (QR<<4) | (ST>>4) in the high half.   Or QRSx>>4 in terms of 16-bit ops
 (UV<<4) | (ST&0x0F) in the low half.  Or xxUV<<4 merge with (STxx>>8)&0x0F

The initial byte-shuffle (vpshufb) that expanded 12 bytes to 16 (in each lane) could give us any arrangement we want within each 32-bit chunk, like ST ST UV QR or UV ST QR UV if either of those was a useful setup for 32-bit or 16-bit shifts (and AND/OR)

For example, if we had ST QR in the high u16, our desired (0QRS) could be obtained with a rotate-left by 4 bits of the 16-bit STQR to bring the T down to the bottom and left-shift the UV part. Then mask to clear the garbage (T) in the high nibble. But we don't have SIMD rotates until AVX-512, and even then only in 32 and 64-bit element size. And we'd need something different for the other 16-bit word.

A rotate can be done with (x<<4) | (x>>12). But if we're emulating it anyway, we could start with two different inputs, and/or shift by amounts that don't add to 16 or 32.

Starting with UV ST | QR ST, 32-bit shifts (by 20 and 12) can create 00 00 | 0UV S and xQ RS | x0 00. We could OR those together without messing up the UVS or the QRS (slightly cheaper than a vpblendw on Intel), but that's not exactly what we want. Still, in general for problems like this it's worth considering 32-bit shifts.

A simple right shift by 4 bits (_mm256_srli_epi16(v, 4)) would turn QR ST into the 0Q RS we want in the high u16 (word) of each u32 dword element. So that's ready to _mm256_blend_epi16 like in the older Q&A, if we can come up with something that generates 0UVT at the bottom of a 32-bit element.

0UVT is trickier: neither byte order (UV ST or ST UV) has the bits we want contiguous with each other.

But with UV ST, the right shift we want for the high half puts also the U and V nibbles in the right place, leaving just the problem of replacing the low 4 bits (S) with the T nibble. In the original initial v (before the shift), we had a copy of T there, so 3 bitwise operations can "blend" it in.

So it turns out QR ST | UV ST is a good layout for the initial vpshufb, making the required data movement closer between the two u16 halves.

With just a shifts and 3 bitwise operations:

 QR ST | UV ST        # after vpshufb
 0Q RS | 0U VS        # after vpsrlw by 4

AND/OR between those two vectors can produce 0Q RS | 0U VT, only needing to replace the low nibble in the low word. (Otherwise keeping everything from the shift result).

 __m256i v = _mm256_shuffle_epi8(in, c);        // QR ST | UV ST

 __m256i shifted = _mm256_srli_epi16(v, 4);     // 0Q RS | 0U VS
 __m256i t = _mm256_and_si256   (_mm256_set1_epi32(0x0000000F), v);  // 00 00 | 00 0T
 shifted   = _mm256_andnot_si256(_mm256_set1_epi32(0x0000000F), shifted); // 0Q RS | 0U V0
 __m256i output = _mm256_or_si256(shifted, t); // 0Q RS | 0U VT

Putting that into a function that loads 24 bytes and returns 32 (ready for the caller to store in a loop), borrowing the code from my answer on SIMD unpack 12-bit fields to 16-bit. I adjusted the shuffle-control vector by swapping the low 2 bytes in each 4-byte chunk vs. that answer. (setr takes args in little-endian order). That gives us UV ST in the low word of each dword while still having QR ST in the high word.

// loads from before the first byte we actually want; beware of using at the start of a buffer
/* static */ inline
__m256i unpack12to16_weird_bitorder(const char *p)
{
    __m256i v = _mm256_loadu_si256( (const __m256i*)(p-4) );
   // v= [ x H G F E | D C B A x ]   where each letter is a 3-byte pair of two 12-bit fields, and x is 4 bytes of garbage we load but ignore

    const __m256i bytegrouping =
        _mm256_setr_epi8(5,4, 5,6,  8,7, 8,9,  11,10, 11,12,  14,13, 14,15, // low half uses last 12B
                         1,0, 1,2,  4,3, 4,5,   7, 6,  7, 8,  10,9, 10,11); // high half uses first 12B
    v = _mm256_shuffle_epi8(v, bytegrouping);   // vpshufb
    // each 16-bit chunk has the bits it needs, but not in the right position
    // in each chunk of 8 nibbles (4 bytes): [ q r  s t | u v  s t ]

    __m256i shifted = _mm256_srli_epi16(v, 4);     // 0Q RS | 0U VS
    __m256i t = _mm256_and_si256   (_mm256_set1_epi32(0x0000000F), v);       // 00 00 | 00 0T
    shifted   = _mm256_andnot_si256(_mm256_set1_epi32(0x0000000F), shifted); // 0Q RS | 0U V0
    return _mm256_or_si256(shifted, t);            // 0Q RS | 0U VT
}

This is only 4 instructions after vpshufb, and three of them are cheap bitwise booleans that can run on any vector execution port on most recent Intel / AMD CPUs, even Haswell. (https://uops.info/). So one more uop than the simpler data arrangement in terms of front-end throughput. Also, only one extra vector constant beyond the vpshufb control vector.

One AVX-512 vpternlogd could replace the three AND/ANDNOT/OR instructions, using the same constant to blend with bit granularity. (Compilers will do this for you if you compile with -march=skylake-avx512 or znver4 or whatever; Godbolt)

Blends with byte or wider granularity can use SSE4 / AVX2 vpblendvb with a control vector, which is 2 uops on Intel (1 on AMD) or only 1 for the SSE version.

Multiply-add instructions for moving+combining in a single-uop

One other possibility for moving nibbles around is to multiply by a power of 2, with pmaddubsw (_mm256_maddubs_epi16, which treats one input as signed, the other as unsigned, and adds horizontal pairs of bytes into 16-bit results). With the unsigned input as the data we want to combine (so it gets zero-extended to 16-bit), we can use 1<<4 = 16 as the signed multiplier.

After masking our inputs to clear the nibble we don't want from each 16-bit word, can we do everything with one vpmaddubsw? No, because as a multiply it can only shift left. So we can't get the S from ST to the bottom of the 0QRS output we want. (And we can't instead produce QRSx and right shift, because our 8-bit multiplier constants can't hold 256.)

We could vpblendw between QR ST | UV ST and a vpsrlw-by-4 shift of that to generate 0Q RS | UV ST... But that doesn't quite work as an input for vpmaddubsw either. The Q would need to get multiplied by 256. But 0QRS is what we already want for that element, so we could just blend after vpmaddwd, between it and a shift, which is better instruction-level parallelism anyway since they can happen in parallel.

Details on getting 0UVT from UV ST: AND to mask away the S, giving UV 0T. Then treated as two u8 integers UV and 0T, do UV*16 + 0T*1 to get UVT. So the other input for pmaddubsw for this element should be 10 01 (hex).

This would just cost one extra instruction (an 8-bit multiply-add) vs. the version for a simpler bit order.

...
    v = _mm256_shuffle_epi8(v, bytegrouping);

    // in each chunk of 8 nibbles (4 bytes): [ q r  s t | u v  s t ]
    __m256i lo = _mm256_srli_epi16(v, 4);                                   // [ 0 q  r s | xxxx ]
    __m256i hi = _mm256_and_si256(v, _mm256_set1_epi32(0x0000'ff0f));       // [  0000 | u v  0 t ]
    hi         = _mm256_maddubs_epi16(hi, _mm256_set1_epi32(0x0000'10'01)); // [  0000 | 0 u  v t ]

    return _mm256_blend_epi16(lo, hi, 0b10101010);
      // nibbles in each pair of epi16: [ 0 q r s | 0 u v t ]

vpmaddubsw is a multiply so it's not the most efficient instruction, but modern mainstream x86 cores have good throughput for it. (2/clock since Skylake and Zen 3, 1/clock on Zen 2: https://uops.info/ On Intel at least, it competes for throughput with vector shifts, but Skylake and later can run those on port 0 or 1. Its latency isn't a concern: out-of-order exec hides that and we're only doing a short chain of ops for each vector.) Hopefully it doesn't waste too much power and reduce turbo frequency.

This is strictly worse than the shift/and/andnot/or version, which uses the same number of uops but more of them are cheaper, and it has fewer vector constants to load for setup. I came up with this pmaddubsw version first; I'm leaving it in the answer as an example of a bit-movement technique that's sometimes useful in other problems. If we didn't need the blend at the end to treat the two u16 halves differently, the madd version could be better.

Note that madd could work with a QR ST | ST UV byte order: you'd just line up the 0x10 multiplier with the other byte. Unlike with 16 or 32-bit shifts where bits being contiguous across byte boundaries is important.