BACKGROUND
I have followed the explanations from these two videos for backpropogation in a CNN.
https://www.youtube.com/watch?v=Pn7RK7tofPg&t=703
https://www.youtube.com/watch?v=Lakz2MoHy6o&t=1455s
As I understand, the gradient for a single kernel will be the input to the current layer convolved with the delta backpropogated from the error of the next layer.
delta_channel = conv(Xm, ERRORn)
Knm' = Knm - learningRate * delta_channnel
Where
n is filter index
m is channel/depth index of filter
K is kernel
Xm is the input to the filter at corresponding depth
ERROR is error backpropgated from output of filter
ISSUE
However this is confusing for me because the shapes will not align.
Say the variable are of the following shapes
K is 3x3x32
X is 128x128x3
ERROR is 128x128x32 (one for each of the outputs of the filters in K zero padded)
Now that means according to the backpropgation equation that delta_channel will be 128x128
It is not possible to subtract the delta from the 3x3 kernel as they are different shapes. Where is my misunderstanding?
EDIT: The real reason is one must use a "full" and "valid" convolution during backpropogation. I mistakenly believed "full" was the same as "same".
There are three types of backpropogation important to remember in this scenario.
I: input
K: kernel
Now the important part is that a full convolution be used in the backpropogation for input, and valid be used for the kernels, as it lines up the error with the parts of the input/filters that contributed to it.
During the forward pass, either same or valid convolutions can be used.
Let's look at example dimensions to see why, for brevity we only look at one side of a square filter:
For these examples we assume valid convolution is used during the forward pass.
This means output error is shape (I-K+1).
Valid Convolution For Kernels: corr(input, error, valid)
Full Convolution For Input Gradient: sum(conv(error, kernel))
Awesome! All the dimensions of the gradients line up to be applied during backpropogation.
Bonus: For those looking how to do this with "same" convolution on the forward pass. Make sure you pad the error from the fully connected layer on your backward pass by kernel-1, so dimensions line up for kernel updates. Subsequent layers will automatically handle this since they use full convolution with the rotated kernels (Ik Ik, I should have used correlation and convolution correctly...).