Search code examples
assemblymatrixintelsimdavx512

How to transpose a 16x16 matrix using SIMD instructions?


I'm currently writing some code targeting Intel's forthcoming AVX-512 SIMD instructions, which supports 512-bit operations.

Now assuming there's a matrix represented by 16 SIMD registers, each holding 16 32-bit integers (corresponds to a row), how can I transpose the matrix with purely SIMD instructions?

There're already solutions to transposing 4x4 or 8x8 matrices with SSE and AVX2 respectively. But I couldn't figure out how to extend it to 16x16 with AVX-512.

Any ideas?


Solution

  • For two operand instructions using SIMD you can show that the number of operations necessary to transpose a nxn matrix is n*log_2(n) whereas using scalar operations it's O(n^2). In fact, later I'll show that the number of read and write operations using the scalar registers is 2*n*(n-1). Below is a table showing the number of operations to transpose 4x4, 8x8, 16x16, and 32x32 matrices using SSE, AVX, AVX512, and AVX1024 compared to the scalar operations

    n            4(SSE)          8(AVX)    16(AVX512)    32(AVX1024)  
    SIMD ops          8              24           64            160
    SIMD +r/w ops    16              40           96            224     
    Scalar r/w ops   24             112          480           1984
    

    where SIMD +r/w ops includes the read and write operations (n*log_2(n) + 2*n).

    The reason the SIMD transpose can be done in n*log_2(n) operations is that the algorithm is:

    permute n 32-bit rows
    permute n 64-bit rows
    ...
    permute n simd_width/2-bit rows
    

    For example, for 4x4 there are 4 rows and therefore you have to permute 32-bit lanes 4 times and then 64-bit lanes 4 times. For 16x16 you have to permute 32-bit lanes , 64-bit lanes, 128-bit lanes, and finally 256-lanes 16 times for each.

    I already showed that 8x8 can be done with 24 operations with AVX. So the question is how to do this for 16x16 using AVX512 in 64 operations? The general algorithm is:

    interleave 32-bit lanes using 
        8x _mm512_unpacklo_epi32
        8x _mm512_unpackhi_epi32
    interleave 64-bit lanes using
        8x _mm512_unpacklo_epi64 
        8x _mm512_unpackhi_epi64 
    permute 128-bit lanes using
       16x _mm512_shuffle_i32x4
    permute 256-bit lanes using again
       16x _mm512_shuffle_i32x4
    

    Here is untested code doing this

        //given __m512i r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, ra, rb, rc, rd, re, rf;
        __m512i t0, t1, t2, t3, t4, t5, t6, t7, t8, t9, ta, tb, tc, td, te, tf;
    
        t0 = _mm512_unpacklo_epi32(r0,r1); //   0  16   1  17   4  20   5  21   8  24   9  25  12  28  13  29 
        t1 = _mm512_unpackhi_epi32(r0,r1); //   2  18   3  19   6  22   7  23  10  26  11  27  14  30  15  31
        t2 = _mm512_unpacklo_epi32(r2,r3); //  32  48  33  49 ...
        t3 = _mm512_unpackhi_epi32(r2,r3); //  34  50  35  51 ...
        t4 = _mm512_unpacklo_epi32(r4,r5); //  64  80  65  81 ...  
        t5 = _mm512_unpackhi_epi32(r4,r5); //  66  82  67  83 ...
        t6 = _mm512_unpacklo_epi32(r6,r7); //  96 112  97 113 ...
        t7 = _mm512_unpackhi_epi32(r6,r7); //  98 114  99 115 ...
        t8 = _mm512_unpacklo_epi32(r8,r9); // 128 ...
        t9 = _mm512_unpackhi_epi32(r8,r9); // 130 ...
        ta = _mm512_unpacklo_epi32(ra,rb); // 160 ...
        tb = _mm512_unpackhi_epi32(ra,rb); // 162 ...
        tc = _mm512_unpacklo_epi32(rc,rd); // 196 ...
        td = _mm512_unpackhi_epi32(rc,rd); // 198 ...
        te = _mm512_unpacklo_epi32(re,rf); // 228 ...
        tf = _mm512_unpackhi_epi32(re,rf); // 230 ...
    
        r0 = _mm512_unpacklo_epi64(t0,t2); //   0  16  32  48 ...
        r1 = _mm512_unpackhi_epi64(t0,t2); //   1  17  33  49 ...
        r2 = _mm512_unpacklo_epi64(t1,t3); //   2  18  34  49 ...
        r3 = _mm512_unpackhi_epi64(t1,t3); //   3  19  35  51 ...
        r4 = _mm512_unpacklo_epi64(t4,t6); //  64  80  96 112 ...  
        r5 = _mm512_unpackhi_epi64(t4,t6); //  65  81  97 114 ...
        r6 = _mm512_unpacklo_epi64(t5,t7); //  66  82  98 113 ...
        r7 = _mm512_unpackhi_epi64(t5,t7); //  67  83  99 115 ...
        r8 = _mm512_unpacklo_epi64(t8,ta); // 128 144 160 176 ...  
        r9 = _mm512_unpackhi_epi64(t8,ta); // 129 145 161 178 ...
        ra = _mm512_unpacklo_epi64(t9,tb); // 130 146 162 177 ... 
        rb = _mm512_unpackhi_epi64(t9,tb); // 131 147 163 179 ...
        rc = _mm512_unpacklo_epi64(tc,te); // 192 208 228 240 ... 
        rd = _mm512_unpackhi_epi64(tc,te); // 193 209 229 241 ...
        re = _mm512_unpacklo_epi64(td,tf); // 194 210 230 242 ...
        rf = _mm512_unpackhi_epi64(td,tf); // 195 211 231 243 ...
    
        t0 = _mm512_shuffle_i32x4(r0, r4, 0x88); //   0  16  32  48   8  24  40  56  64  80  96  112 ...
        t1 = _mm512_shuffle_i32x4(r1, r5, 0x88); //   1  17  33  49 ...
        t2 = _mm512_shuffle_i32x4(r2, r6, 0x88); //   2  18  34  50 ...
        t3 = _mm512_shuffle_i32x4(r3, r7, 0x88); //   3  19  35  51 ...
        t4 = _mm512_shuffle_i32x4(r0, r4, 0xdd); //   4  20  36  52 ...
        t5 = _mm512_shuffle_i32x4(r1, r5, 0xdd); //   5  21  37  53 ...
        t6 = _mm512_shuffle_i32x4(r2, r6, 0xdd); //   6  22  38  54 ...
        t7 = _mm512_shuffle_i32x4(r3, r7, 0xdd); //   7  23  39  55 ...
        t8 = _mm512_shuffle_i32x4(r8, rc, 0x88); // 128 144 160 176 ...
        t9 = _mm512_shuffle_i32x4(r9, rd, 0x88); // 129 145 161 177 ...
        ta = _mm512_shuffle_i32x4(ra, re, 0x88); // 130 146 162 178 ...
        tb = _mm512_shuffle_i32x4(rb, rf, 0x88); // 131 147 163 179 ...
        tc = _mm512_shuffle_i32x4(r8, rc, 0xdd); // 132 148 164 180 ...
        td = _mm512_shuffle_i32x4(r9, rd, 0xdd); // 133 149 165 181 ...
        te = _mm512_shuffle_i32x4(ra, re, 0xdd); // 134 150 166 182 ...
        tf = _mm512_shuffle_i32x4(rb, rf, 0xdd); // 135 151 167 183 ...
    
        r0 = _mm512_shuffle_i32x4(t0, t8, 0x88); //   0  16  32  48  64  80  96 112 ... 240
        r1 = _mm512_shuffle_i32x4(t1, t9, 0x88); //   1  17  33  49  66  81  97 113 ... 241
        r2 = _mm512_shuffle_i32x4(t2, ta, 0x88); //   2  18  34  50  67  82  98 114 ... 242
        r3 = _mm512_shuffle_i32x4(t3, tb, 0x88); //   3  19  35  51  68  83  99 115 ... 243
        r4 = _mm512_shuffle_i32x4(t4, tc, 0x88); //   4 ...
        r5 = _mm512_shuffle_i32x4(t5, td, 0x88); //   5 ...
        r6 = _mm512_shuffle_i32x4(t6, te, 0x88); //   6 ...
        r7 = _mm512_shuffle_i32x4(t7, tf, 0x88); //   7 ...
        r8 = _mm512_shuffle_i32x4(t0, t8, 0xdd); //   8 ...
        r9 = _mm512_shuffle_i32x4(t1, t9, 0xdd); //   9 ...
        ra = _mm512_shuffle_i32x4(t2, ta, 0xdd); //  10 ...
        rb = _mm512_shuffle_i32x4(t3, tb, 0xdd); //  11 ...
        rc = _mm512_shuffle_i32x4(t4, tc, 0xdd); //  12 ...
        rd = _mm512_shuffle_i32x4(t5, td, 0xdd); //  13 ...
        re = _mm512_shuffle_i32x4(t6, te, 0xdd); //  14 ...
        rf = _mm512_shuffle_i32x4(t7, tf, 0xdd); //  15  31  47  63  79  96 111 127 ... 255
    

    I got the idea for using _mm512_shufflei32x4 by looking at transposing a 4x4 matrix using _mm_shuffle_ps (which is what MSVC uses in _MM_TRANSPOSE4_PS but not GCC and ICC).

    __m128 tmp0 ,tmp1, tmp2, tmp3;
    tmp0 = _mm_shuffle_ps(row0, row1, 0x88); // 0 2 4 6
    tmp1 = _mm_shuffle_ps(row0, row1, 0xdd); // 1 3 5 7
    tmp2 = _mm_shuffle_ps(row2, row3, 0x88); // 8 a c e
    tmp3 = _mm_shuffle_ps(row2, row3, 0xdd); // 9 b d f
    
    row0 = _mm_shuffle_ps(tmp0, tmp2, 0x88); // 0 4 8 c 
    row1 = _mm_shuffle_ps(tmp1, tmp3, 0x88); // 1 5 9 d
    row2 = _mm_shuffle_ps(tmp0, tmp2, 0xdd); // 2 6 a e 
    row3 = _mm_shuffle_ps(tmp1, tmp3, 0xdd); // 3 7 b f
    

    the same idea applies to _mm512_shuffle_i32x4 but now the lanes are 128-bit instead of 32-bit and there are 16 rows instead of 4 rows.

    Finally, to compare to scalar operations I modified Example 9.5a from Agner Fog's optimizing C++ manual

    #define SIZE 16
    void transpose(int a[SIZE][SIZE]) { // function to transpose matrix
        // define a macro to swap two array elements:
        #define swapd(x,y) {temp=x; x=y; y=temp;}
        int r, c; int temp;
        for (r = 1; r < SIZE; r++) {
            for (c = 0; c < r; c++) {
                swapd(a[r][c], a[c][r]);
            }
        }
    }
    

    this does n*(n-1)/2 swaps (because the diagonal does not need to be swapped). The swaps from assembly for 16x16 look like

    mov     r8d, DWORD PTR [rax+68]
    mov     r9d, DWORD PTR [rdx+68]
    mov     DWORD PTR [rax+68], r9d
    mov     DWORD PTR [rdx+68], r8d
    

    so the number of read/write operations using the scalar registers is 2*n*(n-1).