Where does the third dimension (as in 4x4x4) of tensor cores come from?

As I understand, the Nvidia tensor cores multiplies two 4x4 matrices and adds the result to a third matrix. Multiplying two 4x4 matrices produces a 4x4 matrix, and adding two 4x4 matrices produces a 4x4 matrix. Still "Each Tensor Core provides a 4x4x4 matrix processing array".

There are 4x multiplication-accumulate operations that are needed for each row*col. I thought maybe the last x4 comes from intermediate result before the accumulation, but I don't think it quite fits with the description on Nvidias pages.

"The FP16 multiply results in a full precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply, as Figure 9 shows."

4x4x4 matrix multiply? I thought matrices was 2dimensions by definition.

Can someone please explain where the last x4 comes from?


  • 4x4x4 is just the notation for multiplication of one 4x4 matrix with another 4x4 matrix.

    If you were to multiply a 4x8 matrix with a 8x4 matrix, you would have 4x8x4. So if A is NxK and B is KxM, then it can be referred to as a NxKxM matrix multiply.

    I just briefly looked up and found this paper, where they use this exact notation (e.g. in Section 4.6 on page 36):