Search code examples
pythonpytorchconv-neural-networkconvolution

What happens in a convolution when the stride is larger than the kernel?


I recently was experiment with convolutions and transposed convolutions in Pytorch. I noticed with the nn.ConvTranspose2d API (I haven't tried with the normal convolution API yet), you can specify a stride that is larger than the kernel size and the convolution will still work.

What is happening in this case? I'm confused because if the stride is larger than the kernel, that means some pixels in the input image will not be convolved. So what happens to them?

I have the following snippet where I manually set the weights for a nn.ConvTranspose2d layer:

IN = 1
OUT = 1
KERNEL_SIZE = 2
proof_conv = nn.ConvTranspose2d(IN, OUT, kernel_size=KERNEL_SIZE, stride=4)
assert proof_conv.weight.shape == (IN, OUT, KERNEL_SIZE, KERNEL_SIZE)

FILTER = [
    [1., 2.],
    [0., 1.]
]
weights = [
    [FILTER]
]

weights_as_tensor = torch.from_numpy(np.asarray(weights)).float()
assert weights_as_tensor.shape == proof_conv.weight.shape
proof_conv.weight = nn.Parameter(weights_as_tensor)

img = [[
  [1., 2.],
  [3., 4.]
]]
img_as_tensor = torch.from_numpy(np.asarray(img)).float()
out_img = proof_conv(img_as_tensor)
assert out_img.shape == (OUT, 6, 6)

The stride is larger than the KERNEL_SIZE of 2. Yet, the transposed convolution still occurs and we get an output of 6x6. What is happening underneath the hood?

This post: Understanding the PyTorch implementation of Conv2DTranspose is helpful but does not answer the edge-case of when the stride is greater than the kernel.


Solution

  • As you already guessed - when the stride is larger than the kernel size, there are input pixels that do not participate in the convolution operation.
    It's up to you - the designer of the architecture to decide whether this property is a bug or a feature. In some cases, I took advantage of this property to ignore portions of the inputs.

    Update:
    I think you are being confused by the bias term in proof_conv. Try to eliminate it:

    proof_conv = nn.ConvTranspose2d(IN, OUT, kernel_size=KERNEL_SIZE, stride=4, bias=False)
    

    Now you'll get out_img to be:

    [[[[1., 2., 0., 0., 2., 4.],
              [0., 1., 0., 0., 0., 2.],
              [0., 0., 0., 0., 0., 0.],
              [0., 0., 0., 0., 0., 0.],
              [3., 6., 0., 0., 4., 8.],
              [0., 3., 0., 0., 0., 4.]]]]
    

    Which represent 4 copies of the kernel, weighted by the input image, spaced 4 pixels apart according to stride=4. The rest of the output image is filled with zeros - representing pixels that do not contribute to the transposed convolution.

    ConvTranspose follows the same "logic" as the regular conv, only in a "transposed" fashion. If you look at the formula for computing output shape you'll see that the behavior you get is consistent.