conv-neural-network linear-algebra convolution cudnn

NCHW input matrix to Dm conversion logic for convolution in cuDNN

I have been trying to understand the convolution lowering operation shown in the cuDNN paper. I was able to understand most of it by reading through and mapping various parameters to the image below. However, I am unable to understand how the original input data (NCHW) was converted into the Dm matrix shown in red.

The ordering of the elements of the Dm matrix does not make sense. Can someone please explain this?

Solution

Each column of Dm corresponds to a tile of the original image. Two examples are shown below:

There is no simple mathematical description of how to extract these tiles (authors call it "non-trivial") but some general comments in section 3.1.

A couple of notes:

The exact layout of data in Dm and Fm is flexible: you could permute the rows of Dm and the columns of Fm or vice-versa.
cuDNN does not actually construct Dm in full, rather it lazily generates columns of Dm as they are needed (see section 3.1 of the paper)
Convolution or cross-correlation? The classical definition of a convolution requires that the filters are flipped (along both axes) before applying them to the image. Modern machine-learning frameworks tend not to do this, and mathematical pedants call this a cross-correlation rather than a convolution. From a machine-learning perspective, it doesn't matter which one you use, but filter-flipping gives convolution nice algebraic properties (e.g. commutativity) and matches the definition of convolution used in mathematics (side note: convolve means to fold or roll). In this cuDNN paper the filters are flipped.