# First convolutional layer: input channels = 1, output channels = 32, kernel size = 5x5, padding = 2 (SAME)
self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=5, stride=1, padding=2)
# First pooling layer: max pooling, kernel size = 2x2, stride = 2
self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
# Second convolutional layer: input channels = 32, output channels = 64, kernel size = 5x5, padding = 2 (SAME)
self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=5, stride=1, padding=2)
# Second pooling layer: max pooling, kernel size = 2x2, stride = 2
self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
Why isn't the output after the second convolution layer 14 * 14 * 32 * 64? For the 32-channel input, each convolutional kernel operates on one channel, resulting in 64 different outcomes. Shouldn't the 32 channels be multiplied together?
I got answers like: for every 14 * 14 position of the input, a 5532 kernel dot product with 5532 input area will give you a 14*14 single chnnel output. Isn't the kernel size 5 * 5?
Let's assume that your input shape is (N, 1, 32, 32)
((N,C,H,W) format).
Now, let us list the output shape after each layer.
self.conv1
: (N, 32, 32, 32)
. 32 filters of shape 5x5x1
are convolved over the padded input (5x5 is the kernel size, 1 is the number of input channels). Hence, the output has 32 feature maps of shape 32x32
.
self.pool1
: (N, 32, 16, 16)
. Pooling layer downsamples feature maps by a factor of 2.
self.conv2
: (N, 64, 16, 16)
. 64 filters of shape 5x5x32
are applied to the padded input (5x5 is the kernel size, 32 is the number of input channels). Output shape after performing convolution operation with one filter is 16x16
. Hence, the output of the layer has 64 feature maps of shape 16x16
.
self.pool2
: (N, 64, 8, 8)
. Pooling layer downsamples feature maps by a factor of 2.