I was trying to learn PyTorch and came across a tutorial where a CNN is defined like below,
class Net(Module):
def __init__(self):
super(Net, self).__init__()
self.cnn_layers = Sequential(
# Defining a 2D convolution layer
Conv2d(1, 4, kernel_size=3, stride=1, padding=1),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=2),
# Defining another 2D convolution layer
Conv2d(4, 4, kernel_size=3, stride=1, padding=1),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=2),
)
self.linear_layers = Sequential(
Linear(4 * 7 * 7, 10)
)
# Defining the forward pass
def forward(self, x):
x = self.cnn_layers(x)
x = x.view(x.size(0), -1)
x = self.linear_layers(x)
return x
I understood how the cnn_layers are made. After the cnn_layers, the data should be flattened and given to linear_layers.
I don't understand how the number of features to Linear is 4*7*7
. I understand that 4 is the output dimension from the last Conv2d layer.
How is 7*7
coming in to picture? Does stride or padding got any role in that?
Input image shape is [1, 28, 28]
Conv2d
layers have a kernel size of 3, stride and padding of 1, which means it doesn't change the spatial size of an image. There are two MaxPool2d
layers which reduce the spatial dimensions from (H, W)
to (H/2, W/2)
. So, for each batch, output of the last convolution with 4 output channels has a shape of (batch_size, 4, H/4, W/4)
. In the forward pass feature tensor is flattened by x = x.view(x.size(0), -1)
which makes it in the shape (batch_size, H*W/4)
. I assume H and W are 28, for which the linear layer would take inputs of shape (batch_size, 196)
.