Search code examples
assemblyx86simdsseavx

int8 x uint8 matrix-vector product with column-major layout


I'm hoping to speed up this matrix-vector product using AVX-1 or earlier instructions:

    // a is an array N columns of length M each
    // b is length N
    // c is length M
    //
    // M % 32 == N % 32 == 0
    // all memory is nicely aligned and unaliased
    
    void mat_vec_prod(const int8_t** a, const uint8_t* b, int16_t* c) {
        for(int i = 0; i < M; ++i) {
            c[i] = 0;
            for(int j = 0; j < N; ++j)
                c[i] += int16_t(a[j][i]) * int16_t(b[j]);
        }
    }

(I'm aware that swapping the loops is worth considering)

The intrinsics _mm_maddubs_epi16 and _mm_maddubs_pi16 could help with the uint8 x int8 dot product, but, in my case, the matrix has the awkward layout, where it's an array of pointers to columns (instead of rows).

One possibility is to load 8x8 patches of a, and then transpose and multiply them by segments of b. (I found this thread on 8x8 byte matrix transpose). However, this would have to use _mm_maddubs_pi16, which has only half the throughput of _mm_maddubs_epi16.

My question is: is it worth trying to load and transpose 16x16 patches instead, or will I run out of xmm registers? What should my strategy be here?


Solution

  • I'd go with the interleaving approach suggested by chtz.

    Read 32 or 64 bytes (aka a full cache line) from two rows, then interleave.

    32 bytes at least, as the width of each row % 32 == 0, and preferably 64 bytes, as that is a full cache line and it would take 8 accumulators out of 16 registers.

    Also I would guess that processing the input as blocks of (8, 16, or 32 rows) by (32 or 64 columns) would be better than processing all the rows; the more rows you process, the less you need to spill the accumulators to memory, with more rows processed in non-linear order the higher the probability of evicting soon to be needed lines from cache. 4 rows should be definitively on safe side.

    Interleaving b is quite naturally done by

    auto b0to7 = _mm_unpacklo_epi16(b,b);
    auto b8tof = _mm_unpackhi_epi16(b,b);
    auto b01 = _mm_shuffle_epi32(b0to7, 0x00);
    auto b23 = _mm_shuffle_epi32(b0to7, 0x55);
    ...
    auto bef = _mm_shuffle_epi32(b8tof, 0xff);
    

    Another possibility of splitting the input to even/odd sequences would need 4 arithmetic instructions per 16 bytes, or 8 instructions per 32 bytes:

    // common terms
    auto b_even = _mm_set1_epi16(b[j] & 0x00ff);
    auto b_odd = _mm_set1_epi16(b[j] * 256);
    // per 16 input bytes in `a`
    auto mul_even = _mm_maddubs_epi16(row_j, b_even);
    auto mul_odd = _mm_maddubs_epi16(row_j, b_odd);
    sum_even = _mm_add_epi16(sum_even, mul_even);
    sum_odd = _mm_add_epi16(mul_odd, mul_even);
    

    This is clearly not as tight as

    auto prod_lo = _mm_unpacklo_epi8(row_j, row_jplus1);
    auto prod_hi = _mm_unpackhi_epi8(row_j, row_jplus1);
    prod_lo = _mm_maddubs_epi16(prod_lo, b01);
    prod_hi = _mm_maddubs_epi16(prod_hi, b01);
    sum_lo = _mm_add_epi16(sum_lo, prod_lo);
    sum_hi = _mm_add_epi16(sum_hi, prod_hi);
    

    but the shuffles can only execute on Port5 where as 2 mul/adds can start every cycle. They are probably quite close in performance.