Efficient transpose of 2D nibble matrix?

Given a 2D 4x8 nibble matrix, represented as a 16-byte uint8_t array. For every pair of nibbles i, j, the byte is computed as so: (j << 4) | i.

For example, given the following matrix:

    0  1  2  3  3  7  1  9 
    4  5  6  7  4  1  6  15
    8  9  10 11 3  14 6  11
    12 13 14 15 8  10 7  4

represented as:

const uint8_t matrix[] = {
    0x10, 0x32, 0x73, 0x91,
    0x54, 0x76, 0x14, 0xf6,
    0x98, 0xba, 0xe3, 0xb6,
    0xdc, 0xfe, 0xa8, 0x47,
};

the desired array array would be:

const uint8_t result[] = {
    0x40, 0xc8, 0x51, 0xd9,
    0x62, 0xea, 0x73, 0xfb,
    0x43, 0x83, 0x17, 0xae,
    0x61, 0x76, 0xf9, 0x4b,
}

How to implement a function that achieves this most efficiently? Extensions up to AVX2 are fair game.

This is my C implementation so far, based on Nibble shuffling with x64 SIMD. It splits the matrix into two 64bit inputs, unpacks the nibbles, shuffles them and re-packs them.

__m128i unpack_nibbles(__m128i src) {
    __m128i nibbles_hi = _mm_srli_epi64(src, 4);

    //Interlave high nibbles with full nibbles [0000 hi, hi lo, ...] and clear high
    __m128i unpacked = _mm_unpacklo_epi8(src, nibbles_hi);
    return _mm_and_si128(unpacked, _mm_set1_epi8(0xf));
}

void transpose_4x8_nibbles(uint8_t *src, uint8_t *dst) {
    uint8_t *src_lo = src + 0x8;

    __m128i data_hi = _mm_loadl_epi64((__m128i*)src);
    __m128i data_lo = _mm_loadl_epi64((__m128i*)src_lo);

    data_hi = unpack_nibbles(data_hi);
    data_lo = unpack_nibbles(data_lo);

    //Transpose
    __m128i transpose_mask = _mm_setr_epi8(0, 0x8, 0x1, 0x9, 0x2, 0xa, 0x3, 0xb, 0x4, 0xc, 0x5, 0xd, 0x6, 0xe, 0x7, 0xf);
    data_hi = _mm_shuffle_epi8(data_hi, transpose_mask);
    data_lo = _mm_shuffle_epi8(data_lo, transpose_mask);

    //Pack nibbles
    __m128i pack_mask = _mm_set1_epi16(0x1001);
    data_hi = _mm_maddubs_epi16(data_hi, pack_mask);  //even bytes are multiplied by 0x10, odd bytes by 0x01
    data_lo = _mm_maddubs_epi16(data_lo, pack_mask);
    
    __m128i data = _mm_packus_epi16(data_hi, data_lo);
    data = _mm_shuffle_epi8(data, transpose_mask);
    
    _mm_store_si128((__m128i*) dst, data);
}

Solution

Lets name the nibbles as follows (everything in little endian order):

X0 Y0 X1 Y1 X2 Y2 X3 Y3
Z0 W0 Z1 W1 Z2 W2 Z3 W3
X4 Y4 X5 Y5 X6 Y6 X7 Y7
Z4 W4 Z5 W5 Z6 W6 Z7 W7

After transposing we note that X nibbles stay in the low nibble, W nibbles stay in the high nibble, Y nibbles move from high to low and Z nibbles move from low to high:

X0 Z0 X4 Z4
Y0 W0 Y4 W4
X1 Z1 X5 Z5
Y1 W1 Y5 W5
X2 Z2 X6 Z6
Y2 W2 Y6 W6
X3 Z3 X7 Z7
Y3 W3 Y7 W7

That means the X and W nibbles can be placed correctly by a simple pshufb. The Z nibbles all need to be shifted up (or multiplied by 0x10) the Y nibbles need to be shifted down (or multiplied as a uint16 block by 0x1000 and taking the upper half of the result).

One block of 00 Z0 00 Z4 Y0 00 Y4 00 is actually like a 32bit integer and we can almost directly get this from Z0 00 Z4 00 and 00 Y0 00 Y4 by a single pmaddwd instruction with 0x10 and 0x1000:

00 Z0 00 Z4 Y0 00 Y4 00 = (00 Y0 00 Y4)* 0x1000 + (Z0 00 Z4 00) * 0x10

And these nibbles happen to be in the same bytes as X0, X4 and W0, W4 so only one pshufb is necessary to arrange the bytes accordingly, but unfortunately, if Y4>7 we have a negative integer which requires masking away some bits again (at least, we can re-use the same mask).

Overall, this function should do the job:

void transpose_4x8_nibbles(uint8_t const *src, uint8_t *dst) {
    __m128i const input = _mm_loadu_si128((__m128i const*)src);

    __m128i const shuff = _mm_shuffle_epi8(input, _mm_setr_epi8(0, 8, 4, 12, 1, 9, 5, 13, 2, 10, 6, 14, 3, 11, 7, 15));
    __m128i const mask = _mm_set1_epi32(0x0f0ff0f0);
    __m128i const XW = _mm_andnot_si128(mask, shuff);
    __m128i const YZ = _mm_and_si128(mask, shuff);
    __m128i const YZ_trans = _mm_madd_epi16(YZ, _mm_set1_epi32(0x00101000));
    __m128i const result = _mm_or_si128(XW, _mm_and_si128(mask, YZ_trans));

    _mm_storeu_si128((__m128i*)dst, result);
}

Godbolt demo (only SSSE3 is required because of pshufb): https://godbolt.org/z/c43oTz43r