python machine-learning deep-learning pytorch resnet

How to calculate kernel dimensions from original image dimensions?

https://github.com/kuangliu/pytorch-cifar/blob/master/models/resnet.py

From reading https://www.cs.toronto.edu/~kriz/cifar.html the cifar dataset consists of images each with 32x32 dimension.

My understanding of code :

self.conv1 = nn.Conv2d(3, 6, 5)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1   = nn.Linear(16*5*5, 120)

Is :

self.conv1 = nn.Conv2d(3, 6, 5) # 3 channels in, 6 channels out ,  kernel size of 5
self.conv2 = nn.Conv2d(6, 16, 5) # 6 channels in, 16 channels out ,  kernel size of 5
self.fc1   = nn.Linear(16*5*5, 120) # 16*5*5 in features , 120 ouot feature

From resnet.py the following :

self.fc1   = nn.Linear(16*5*5, 120)

From http://cs231n.github.io/convolutional-networks/ the following is stated :

Summary. To summarize, the Conv Layer:

Accepts a volume of size W1×H1×D1 Requires four hyperparameters: Number of filters K, their spatial extent F, the stride S, the amount of zero padding P. Produces a volume of size W2×H2×D2 where: W2=(W1−F+2P)/S+1 H2=(H1−F+2P)/S+1 (i.e. width and height are computed equally by symmetry) D2=K With parameter sharing, it introduces F⋅F⋅D1 weights per filter, for a total of (F⋅F⋅D1)⋅K weights and K biases. In the output volume, the d-th depth slice (of size W2×H2) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.

From this I'm attempting to understand how the training image dimension 32x32 (1024 pixels) is transformed to feature map (16*5*5 -> 400) as aprt of nn.Linear(16*5*5, 120)

From https://pytorch.org/docs/stable/nn.html#torch.nn.Conv2d can see default stride is 1 and padding is 0.

What are steps to arrive at 16*5*5 from image dimension of 32*32 and can 16*5*5 be derived from above steps ?

From above steps how to calculate spatial extent ?

Update :

Source code :

'''LeNet in PyTorch.'''
import torch.nn as nn
import torch.nn.functional as F

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1   = nn.Linear(16*5*5, 120)
        self.fc2   = nn.Linear(120, 84)
        self.fc3   = nn.Linear(84, 10)

    def forward(self, x):
        out = F.relu(self.conv1(x))
        out = F.max_pool2d(out, 2)
        out = F.relu(self.conv2(out))
        out = F.max_pool2d(out, 2)
        out = out.view(out.size(0), -1)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        return out

Taken from https://github.com/kuangliu/pytorch-cifar/blob/master/models/lenet.py

My understanding is that the convolution operation is applied to the image data per kernel. So if 5 kernels are set then 5 convolutions are applied to the data which generates a 5 dimensional image representation.

Solution

You do not provide enough information in your question (see my comment).

However, if I have to guess then you have two pooling layers (with stride 2) in between your convolution layers:

input size 32x32 (3 channels)
conv1 output size 28x28 (6 channels): conv with no padding and kernel size 5, reduces input size by 4.
Pooling layer with stride 2, output size 14x14 (6 channels).
conv2 output size 10x10 (16 channels)
Another pooling layer with stride 2, output size 5x5 (16 channels)
A fully connected layer (nn.Linear) connecting all 5x5x16 inputs to all 120 outputs.

A more thorough guide for estimating the receptive field can be found here.