I am looking at image embeddings and wondering why flipping images changes the output. Consider resnet18 with the head removed for example:
import torch
import torch.nn as nn
import torchvision.models as models
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = models.resnet18(pretrained=True)
model.fc = nn.Identity()
model = model.to(device)
model.eval()
x = torch.randn(20, 3, 128, 128).to(device)
with torch.no_grad():
y1 = model(x)
y2 = model(x.flip(-1))
y3 = model(x.flip(-2))
The last layer looks like this and most importantly has a AdaptiveAveragePooling
as the last layer where the pixels/ features are pooled to 1 pixel:
According to how I'm thinking, since we are just having convolutions on top of convolutions, before the pooling, all that will happen is that the feature map will flip according to how the image is flipped. The average pooling simply averages the last feature map (along each channel), and is invariant to the orientation of it. AdaptiveMaxPool
should have been the same.
The key difference between 'normal' convnets being that we are pooling/ averaging to one pixel width.
However, when I look at y1-y2
, y1-y3
, y2-y3
the values are significantly different to zero. What am I thinking wrong about?
I think the pooling output is changed because the inputs to the pooling layer are not passed as we expect.
Short Answer: The input is flipped but not the weights of Conv2d layers. These kernel weights need to be flipped as well in accordance with the input flipping to get the expected output.
Long Answer: Here, as per the tail of the model, the output of Conv2d
is passed to AdaptiveAveragePooling
. Let's just ignore BatchNorm
for now for the sake of understanding.
For simplicity, lets consider a input tensor as x = [1, 3, 5, 4, 7]
and a kernel is k =[0.3, 0.5, 0.8]
. When it rolls over the input, the output for position [0,0] will be [0.3*1+0.5*3+0.8*5] = 6.8 and [0,2] will be [0.3*5+0.5*4+0.8*7]=9.3 considering stride=1
.
Now if the input is flipped, x_flip = [7, 4, 5, 3, 1]
, the output for position [0,0] will be [0.3*7+0.5*4+0.8*5] = 8.1 and [0,2] will be [0.3*5+0.5*3+0.8*1] = 3.8.
As head and tail of the outputs are different in both scenario (8.1 != 9.3 and 6.8 != 3.8), the output we get after Convolution layer would be different, giving different/unexpected results as the final output after pooling.
So, to get the desired output here, you need to flip the kernel as well.