During backpropagation, will these cases have different effect:-
My main doubts in regarding the numerical value but the effect all these would be having.
The difference between no 1 and 2 is basically : since sum
will result in bigger than mean
, the magnitude of gradients from sum
operation will be bigger, but direction will be same.
Here's a little demonstration, lets first declare necessary variables:
x = torch.tensor([4,1,3,7],dtype=torch.float32,requires_grad=True)
target = torch.tensor([4,2,5,4],dtype=torch.float32)
Now lets compute gradient for x
using L2
loss with sum
:
loss = ((x-target)**2).sum()
loss.backward()
print(x.grad)
This outputs: tensor([ 0., -2., -4., 6.])
Now using mean
: (after resetting x
grad)
loss = ((x-target)**2).mean()
loss.backward()
print(x.grad)
And this outputs: tensor([ 0.0000, -0.5000, -1.0000, 1.5000])
Notice how later gradients are exactly 1/4th of that of sum
, that's because the tensors here contain 4 elements.
About third option, if I understand you correctly, that's not possible. You can not backpropagate before aggregating individual pixel errors to a scalar, using sum
, mean
or anything else.