Accumulating Two Tensor Core wmma::accumulator Fragments

Let's say that I have two instances of wmma::fragment<wmma::accumulator, 16, 16, 16, half> a, b; (namely a and b). How would I go about conducting an element-wise addition of a and b and storing the result back into a?

Solution

wmma fragments are actually stored in registers of the threads of a warp. So operations can be done, if each thread knows, what to do.

Scientists at the Tokyo Institute of Technology have developed a C++ library wmma_extension to (among other functions like recovering FP32 accuracy from TF32 tensor core operations) easily do arithmetic operations on wmma fragments.

The library can be found here: https://github.com/wmmae/wmma_extension

Doing arithmetic operations as a simple one-liner (plus the include) is shown here: https://github.com/wmmae/wmma_extension/blob/main/docs/ops.md

The scientists have released two related papers in 2023:

https://arxiv.org/pdf/2308.15152.pdf (Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library)
https://arxiv.org/pdf/2203.03341.pdf (Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance)