Search code examples
cudacuda-wmma

Cuda Tensor Cores: Matrix size only 16x16


I have this very minimal code to multiply two matrices with Cuda Tensor Cores

constexpr int M = 16;
constexpr int N = 16;
constexpr int K = 16;

/*
 *  Matrix A = M x N, B = N x K, C = M x K => OUT = M x K
 */
__global__ void wmma_matrix_mult(half *a, half *b, float *out) {

   // Declare the fragments
   wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, M, N, K, float> c_frag;

   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0.0f);

   // Load the inputs
   wmma::load_matrix_sync(a_frag, a, N);
   wmma::load_matrix_sync(b_frag, b, N);

   // Perform the matrix multiplication
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

   // Store the output
   wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
}

As soon as M, N and K are something other than 16, the compiler crashes with

error: incomplete type is not allowed

error: no instance of function template "nvcuda::wmma::fill_fragment" matches the argument list
            argument types are: (<error-type>, float)

Does this mean that A and B always have to be the 16x16 in size? I thought 4x4 or 8x8 would be allowed as well?

I compile like this:

nvcc -arch=sm_75 -c ./src/main.cu -o ./src/build/main.o

so architecture should be fine.


Solution

  • I thought 4x4 or 8x8 would be allowed as well?

    Unfortunately not. Let's read some documentation.

    For half precision inputs with a single precision accumulator, as in your use case, only the following sizes are supported:

    Matrix A    Matrix B    Accumulator Matrix Size (m-n-k)
    __half      __half      float       16x16x16
    __half      __half      float       32x8x16
    __half      __half      float       8x32x16