python machine-learning deep-learning pytorch conv-neural-network

Does pytorch CNN care about image size?

I am playing with CNNs these days, and I have code like pasted below. My question is, would this work on any image size? It is not clear to me what parameter or channel, if any, cares about the image size? And if that's the case, how does the model know how many neurons it needs, isn't that a function of image size?

Related point on pretrained models - if I use pretrained models, do I need to reformat my images to be same as what the model was trained on in the first place, or how does that work?

class CNN(nn.Module):
    def __init__(self, num_classes, num_channels=1):
        super(CNN, self).__init__()
        self.num_classes = num_classes
        self.conv1 = nn.Conv2d(num_channels, 32, kernel_size=3, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc = nn.Linear(64*7*7, num_classes)

Solution

No, but the Conv2D() layer can work on variable image sizes. The way the Conv2D() layer works is by applying the kernel on each set of kernel_size x kernel_size pixels. The padding adds to the dimensions before the convolution as well. So as long as the image dimensions are at least kernel_size x kernel_size after padding, the convolution will work. In this case, kernel_size is 3 and padding is 1 so even a single pixel will work because after padding, the image will be 3x3. The reason this CNN can't take variable image sizes is because of the linear layer. The linear layer requires a 7x7 image after max-pooling twice, so a 28x28 image to start with.
It depends on the type of layer, linear layers take in a set input size and a set output size, which determines the number of parameters (neurons). The number of parameters for convolution layers are determined by the kernel size and number of output channels so it doesn't rely on input size.
It depends on the model, some models allow variable image sizes even if they were trained on a specific resolution.