Transpose 4x4 int32 matrix using NEON

How can I efficiently transpose a matrix represented as four int32x4t values? I cannot use ld4q_s32 and st4q_s32.

Solution

Transposing is typically implemented in the recursive form:

   [A B]' == [A' C']
   [C D]     [B' D']

where each A,B,C,D can be though of being a square matrix (and A' == A, for 1x1 matrices).

Using transpose will unfortunately help you only on the level that needs to transpose the lowest 2x2 matrices in parallel -- the ARM64 instruction set lacks an instruction to transpose the highest level, i.e. 64 bits at a time, which was done on arm-v7 as swp d2, d0.

There's anyway another iterative formula for transpose: it's repeatedly unzipping or zipping the input.

in octave/matlab language we define the zip operator as one, that takes the first N/2 elements of an array a and interleaves them with the last N/2 elements of an array.

  zip = @(a) [a(1:end/2);a(end/2+1:end)](:)';
  zip(0:15) %--> 0  8  1  9  2 10  3 11  4 12  5 13  6 14  7 15
  zip(ans)  %--> 0  4  8 12  1  5  9 13  2  6 10 14  3  7 11 15

Thus the complete code using intrinsics could be something like:

inline int32x4x4_t zip(int32x4x4_t a) {
   return { vzip1q_s32(a.val[0], a.val[2]),
            vzip2q_s32(a.val[0], a.val[2]),
            vzip1q_s32(a.val[1], a.val[3]),
            vzip2q_s32(a.val[1], a.val[3]) };
}

int32x4x4_t transpose(int32x4x4_t a) {
    return zip(zip(a));
}

On some architectures one can also utilise vtbl, if one has 8 extra registers to spare (4 for the indices and 4 for the output)

uint8x16x4_t transpose_vtbl(uint8x16x4_t a) {
   return {
      vqtbl4q_u8(a, idx0),
      vqtbl4q_u8(a, idx1),
      vqtbl4q_u8(a, idx2),
      vqtbl4q_u8(a, idx3)};
}

Here the int32x4x4_t input needs to be cast element-wise with vreinterpretq_u8_s32(input.val[i]) and back, with idx0 having the values of 0 1 2 3 16 17 18 19 32 33 34 35 48 49 50 51, idx1 == idx0 + 4 and so on.

On M1 or M2 VTBL seems to instruction level parallelise, contrary to eg. a moderately recent cortex A75 (v8.2), where the vtbl takes 3N+1 clock cycles with no dual-issue capabilities.