deep-learning pytorch conv-neural-network

How to compute the parameters in the CNN classifier?

I have implemented the CNN model for mnist. I was able understand how to compute the parameters and shapes for different layers of CNN but I wanted to understand how determine the in_features and out_features in classifier part, specifically the nn.Linear(). Also, how to select in_channels, out_channels in nn.Conv2d?

class CNNclf(nn.Module):
def __init__(self):
    super().__init__()
    self.net = nn.Sequential(
        nn.Conv2d(in_channels=1, out_channels=64, kernel_size=3),
        nn.ReLU(),
        nn.MaxPool2d((2, 2), stride=2),
        nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3),
        nn.ReLU(),
        nn.MaxPool2d((2, 2), stride=3),
        nn.Conv2d(in_channels=128, out_channels=64, kernel_size=3),
        nn.ReLU(),
        nn.MaxPool2d((2, 2), stride=2))
    self.clf = nn.Sequential(
        nn.Flatten(),
        nn.Linear(64, 20, bias=True),
        nn.ReLU(),
        nn.Linear(20, 10, bias=True))

def forward(self, x):
    x = self.net(x)
    x = self.clf(x)
    return x

Solution

This is a typical question when it comes to CNN, there are many posts alike where users encounter the same type of error originating from this.

RuntimeError: mat1 and mat2 shapes cannot be multiplied (ixj and kxl)

I will provide a canonical answer here.

When working with nn.Conv2d, intermediate tensors will be four-dimensional: (b, c, h, w). Convolutions work spatially, they move across the tensor across the height and width dimensions (for 2D convs). The number of output channels is determined by the number of filters in the convolution layer which operates independently of each other. You can read more about convolution layers and sizes in: Understanding convolutional layers shapes.

When it comes to CNN architectures, you have to accommodate for a change in dimensionality when moving from the convolutional part (feature extractor) to the fully connected layers (classifier). In the general case, the tensor shape is going from 4D to 2D. This requires some form of spatiality reduction: either

via flattening, leading to a shape of (b, c*h*w). This can be done using nn.Flatten;
or using a pooling operation such as a maximum nn.MaxPool2d or average pooling nn.AdaptiveAvgPool2d, resulting in a reduced shape of (b, c, h', w'). If h' and w' are not singletons, a flattening operation is still necessary.

Ultimately the output shape of the last convolution layer depends on two things: the input shape and the number and sizes of convolutions preceding it. The above error refers to a shape mismatch between the output of the CNN and the shape expected by the linear layer. i is the batch size, j is the actual flattened feature length, k is the in_features of the first linear layer, and l is its out_features. So in case you get this error, you already know which in_features to use!

To anticipate this error and avoid throwing it while debugging the architecture, another way to determine in_features is by truncating the model (removing all linear layers) and performing inference with dummy data. Observing the output shape of that inference will inform you of the in_features to adopt.

>>> CNNclf().net(torch.rand(3,1,100,100)).shape # adapt with your input shape
torch.Size([3, 64, 7, 7])

Therefore the spatial dimension is 7x7 and the channel count is 64, so the feature dimension is 64*7*7 = 3136. In this case, the first linear layer must be initialized as nn.Linear(3136, 20, bias=True).

Since version 1.8, there exists a class nn.LazyLinear which infers the in_features automatically at runtime (during the first inference of the model). In this case, no need to perform the dummy inference yourself, simply use nn.LazyLinear(20, bias=True)