Search code examples
c++deep-learningcudagpucuda-wmma

Accumulating Two Tensor Core wmma::accumulator Fragments


Let's say that I have two instances of wmma::fragment<wmma::accumulator, 16, 16, 16, half> a, b; (namely a and b). How would I go about conducting an element-wise addition of a and b and storing the result back into a?


Solution

  • wmma fragments are actually stored in registers of the threads of a warp. So operations can be done, if each thread knows, what to do.

    Scientists at the Tokyo Institute of Technology have developed a C++ library wmma_extension to (among other functions like recovering FP32 accuracy from TF32 tensor core operations) easily do arithmetic operations on wmma fragments.

    The library can be found here: https://github.com/wmmae/wmma_extension

    Doing arithmetic operations as a simple one-liner (plus the include) is shown here: https://github.com/wmmae/wmma_extension/blob/main/docs/ops.md

    The scientists have released two related papers in 2023: