I don't understand a passage in the article about the VGGNet. Maybe someone can help.
In my opinion, the number of weights in a convolutional layer is
p=w*h*d*n+n
where w is the width of the filters, h the height of the filters, d the depth of the filters and n the num of the filters.
In the article the following is written:
assuming that both the input and the output of a three-layer 3 × 3 onvolution stack has C channels, the stack is parametrised by 3*(3^2*C^2) = 27C^2 weights; at the same time, a single 7 × 7 conv. layer would require 7^2*C^2 = 49C^2 parameters.
I do not understand, what is meant by channels here, and why this formula is used.
Can someone explain this to me?
Thanks in advance.
Your intuition is correct; we just need to unpack their explanation a bit. For the first case:
w = 3 # filter width
h = 3 # filter height
d = C # filter depth (number of channels is same as number of input filters; eg RGB is C=3)
n = C # number of output filters/channels
This then makes whdn = 9C^2
parameters. Then, they also say there are three of these stacked, so thats 27C^2
.
For a single 7x7
filter, then it's all the same 7x7xCxCx1
.
The final difference is that you add n
once more at the end in your original post; that is the bias terms, which in VGG they skip (many people skip bias terms; their value is debatable in some settings).