machine-learning deep-learning conv-neural-network backpropagation

Mistmatched kernel and delta kernel shapes in backpropogation of a CNN

BACKGROUND
I have followed the explanations from these two videos for backpropogation in a CNN.
https://www.youtube.com/watch?v=Pn7RK7tofPg&t=703
https://www.youtube.com/watch?v=Lakz2MoHy6o&t=1455s

As I understand, the gradient for a single kernel will be the input to the current layer convolved with the delta backpropogated from the error of the next layer.

delta_channel = conv(Xm, ERRORn)
Knm' = Knm - learningRate * delta_channnel

Where
n is filter index
m is channel/depth index of filter
K is kernel
Xm is the input to the filter at corresponding depth
ERROR is error backpropgated from output of filter

ISSUE
However this is confusing for me because the shapes will not align.

Say the variable are of the following shapes
K is 3x3x32
X is 128x128x3
ERROR is 128x128x32 (one for each of the outputs of the filters in K zero padded)

Now that means according to the backpropgation equation that delta_channel will be 128x128

It is not possible to subtract the delta from the 3x3 kernel as they are different shapes. Where is my misunderstanding?

Solution

EDIT: The real reason is one must use a "full" and "valid" convolution during backpropogation. I mistakenly believed "full" was the same as "same".

There are three types of backpropogation important to remember in this scenario.

I: input
K: kernel

Valid Convolution
- The filter is slid within bounds of input. No zero padding.
- The output is (I-K+1) x (I-K+1) in dimensions.
Same Convolution
- The filter is slid such that output size is equal to input size.
- The output is I x I in dimensions.
Full Convolution
- The filter is slid as long as part of it overlaps with input.
- The output is (I+K-1) x (I+K-1) in dimensions.

Now the important part is that a full convolution be used in the backpropogation for input, and valid be used for the kernels, as it lines up the error with the parts of the input/filters that contributed to it.

During the forward pass, either same or valid convolutions can be used.

Let's look at example dimensions to see why, for brevity we only look at one side of a square filter:

For these examples we assume valid convolution is used during the forward pass.

This means output error is shape (I-K+1).

Valid Convolution For Kernels: corr(input, error, valid)

out_dim = I - (I - K + 1) + 1
out_dim = I - I + K - 1 + 1
out_dim = K

Full Convolution For Input Gradient: sum(conv(error, kernel))

out_dim = (I-K+1) + (K) - 1
out_dim = I - K + K + 1 - 1
out_dim = I

Awesome! All the dimensions of the gradients line up to be applied during backpropogation.

Bonus: For those looking how to do this with "same" convolution on the forward pass. Make sure you pad the error from the fully connected layer on your backward pass by kernel-1, so dimensions line up for kernel updates. Subsequent layers will automatically handle this since they use full convolution with the rotated kernels (Ik Ik, I should have used correlation and convolution correctly...).