Search code examples
pythonmachine-learningdeep-learningpytorchresnet

How to calculate kernel dimensions from original image dimensions?


https://github.com/kuangliu/pytorch-cifar/blob/master/models/resnet.py

From reading https://www.cs.toronto.edu/~kriz/cifar.html the cifar dataset consists of images each with 32x32 dimension.

My understanding of code :

self.conv1 = nn.Conv2d(3, 6, 5)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1   = nn.Linear(16*5*5, 120)

Is :

self.conv1 = nn.Conv2d(3, 6, 5) # 3 channels in, 6 channels out ,  kernel size of 5
self.conv2 = nn.Conv2d(6, 16, 5) # 6 channels in, 16 channels out ,  kernel size of 5
self.fc1   = nn.Linear(16*5*5, 120) # 16*5*5 in features , 120 ouot feature

From resnet.py the following :

self.fc1   = nn.Linear(16*5*5, 120)

From http://cs231n.github.io/convolutional-networks/ the following is stated :

Summary. To summarize, the Conv Layer:

Accepts a volume of size W1×H1×D1 Requires four hyperparameters: Number of filters K, their spatial extent F, the stride S, the amount of zero padding P. Produces a volume of size W2×H2×D2 where: W2=(W1−F+2P)/S+1 H2=(H1−F+2P)/S+1 (i.e. width and height are computed equally by symmetry) D2=K With parameter sharing, it introduces F⋅F⋅D1 weights per filter, for a total of (F⋅F⋅D1)⋅K weights and K biases. In the output volume, the d-th depth slice (of size W2×H2) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.

From this I'm attempting to understand how the training image dimension 32x32 (1024 pixels) is transformed to feature map (16*5*5 -> 400) as aprt of nn.Linear(16*5*5, 120)

From https://pytorch.org/docs/stable/nn.html#torch.nn.Conv2d can see default stride is 1 and padding is 0.

What are steps to arrive at 16*5*5 from image dimension of 32*32 and can 16*5*5 be derived from above steps ?

From above steps how to calculate spatial extent ?

Update :

Source code :

'''LeNet in PyTorch.'''
import torch.nn as nn
import torch.nn.functional as F

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1   = nn.Linear(16*5*5, 120)
        self.fc2   = nn.Linear(120, 84)
        self.fc3   = nn.Linear(84, 10)

    def forward(self, x):
        out = F.relu(self.conv1(x))
        out = F.max_pool2d(out, 2)
        out = F.relu(self.conv2(out))
        out = F.max_pool2d(out, 2)
        out = out.view(out.size(0), -1)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        return out

Taken from https://github.com/kuangliu/pytorch-cifar/blob/master/models/lenet.py

My understanding is that the convolution operation is applied to the image data per kernel. So if 5 kernels are set then 5 convolutions are applied to the data which generates a 5 dimensional image representation.


Solution

  • You do not provide enough information in your question (see my comment).

    However, if I have to guess then you have two pooling layers (with stride 2) in between your convolution layers:

    • input size 32x32 (3 channels)
    • conv1 output size 28x28 (6 channels): conv with no padding and kernel size 5, reduces input size by 4.
    • Pooling layer with stride 2, output size 14x14 (6 channels).
    • conv2 output size 10x10 (16 channels)
    • Another pooling layer with stride 2, output size 5x5 (16 channels)
    • A fully connected layer (nn.Linear) connecting all 5x5x16 inputs to all 120 outputs.

    A more thorough guide for estimating the receptive field can be found here.