machine-learning deep-learning classification conv-neural-network imagenet

Understanding a passage in the Paper about VGGNet

I don't understand a passage in the article about the VGGNet. Maybe someone can help.

In my opinion, the number of weights in a convolutional layer is

p=w*h*d*n+n

where w is the width of the filters, h the height of the filters, d the depth of the filters and n the num of the filters.

In the article the following is written:

assuming that both the input and the output of a three-layer 3 × 3 onvolution stack has C channels, the stack is parametrised by 3*(3^2*C^2) = 27C^2 weights; at the same time, a single 7 × 7 conv. layer would require 7^2*C^2 = 49C^2 parameters.

I do not understand, what is meant by channels here, and why this formula is used.

Can someone explain this to me?

Thanks in advance.

Solution

Your intuition is correct; we just need to unpack their explanation a bit. For the first case:

w = 3 # filter width
h = 3 # filter height
d = C # filter depth (number of channels is same as number of input filters; eg RGB is C=3)
n = C # number of output filters/channels

This then makes whdn = 9C^2 parameters. Then, they also say there are three of these stacked, so thats 27C^2.

For a single 7x7 filter, then it's all the same 7x7xCxCx1.

The final difference is that you add n once more at the end in your original post; that is the bias terms, which in VGG they skip (many people skip bias terms; their value is debatable in some settings).