neural-network deep-learning caffe conv-neural-network convolution

How to calculate third element of caffe convnet?

Following this question and this tutorial I've create a simple net just like the tutorial but with 100X100 images and first convolution kernel of 11X11 and pad=0.

I understand that the formula is : (W−F+2P)/S+1 and in my case dimension became [51X51X3] (3 is channel of rgb) but the number 96 popup in my net diagram and as this tutorial said it is third dimension of the output, in other hand , my net after first conv became [51X51X96]. I couldn't figure out , how the number 96 calculated and why.

Isn't the network convolution layer suppose to pass throw three color channel and the output should be three feature map? How come its dimension grow like this? Isn't it true that we have one kernel for each channel ? How this one kernel create 96(or in the first tutorial, 256 or 384) feature map ?

Solution

You are mixing input channels and output channels.
Your input image has three channels: R, G and B. Each filter in your conv layer acts on these three channels and its spatial kernel size (e.g., 3-by-3). Each filter outputs a single number per spatial location. So, if you have one filter in your layer then your output would have only one output channel(!)
Normally, you would like to compute more than a single filter at each layer, this is what num_output parameter is used for in convolution_param: It allows you to define how many filters will be trained in a specific convolutional layer.
Thus a Conv layer

layer {
  type: "Convolution"
  name: "my_conv"
  bottom: "x"  # shape 3-by-100-by-100
  top: "y"
  convolution_param {
    num_output: 32  # number of filters = number of output channels
    kernel_size: 3
  }
}

Will output "y" with shape 32-by-98-by-98.