Suppose I have a neural network where I use a normal distribution initialization and I want to use the mean value which is used for initialization as a parameter of the network.
I have a small example:
import torch
parameter_vector = torch.tensor(range(10), dtype=torch.float, requires_grad=True)
sigma = torch.ones(parameter_vector.size(0), dtype=torch.float)*0.1
init_result = torch.normal(parameter_vector, sigma)
print('requires_grad:', init_result.requires_grad)
print('result: ', init_result)
This results in:
requires_grad: True
result: tensor([ 0.1026, 0.9183, 1.9586, 3.1778, 4.0538, 4.8056, 5.9561,
6.9501, 7.7653, 8.9583])
So the requires_grad
flag was obviously taken over from the mean value tensor resp. parameter_vector
.
But does this automatically mean that the parameter_vector
will be updated through backward()
in a larger network where init_result
does affect the end result?
Especially as normal()
does not really seem like normal operation because it involves randomness.
Thanks to @iacolippo (see comments below the question) the problem is solved now. I just wanted to supplement this by posting what code I am using now, so this may help anyone else.
As presumed in the question and also stated by @iacolippo the code posted in the question is not backpropable:
import torch
parameter_vector = torch.tensor(range(5), dtype=torch.float, requires_grad=True)
print('- initial parameter weights:', parameter_vector)
sigma = torch.ones(parameter_vector.size(0), dtype=torch.float)*0.1
init_result = torch.normal(parameter_vector, sigma)
print('- normal init result requires_grad:', init_result.requires_grad)
print('- normal init vector', init_result)
#print('result: ', init_result)
sum_result = init_result.sum()
sum_result.backward()
print('- summed dummy-loss:', sum_result)
optimizer = torch.optim.SGD([parameter_vector], lr = 0.01, momentum=0.9)
optimizer.step()
print()
print('- parameter weights after update:', parameter_vector)
Out:
- initial parameter weights: tensor([0., 1., 2., 3., 4.], requires_grad=True)
- normal init result requires_grad: True
- normal init vector tensor([-0.0909, 1.1136, 2.1143, 2.8838, 3.9340], grad_fn=<NormalBackward3>)
- summed dummy-loss: tensor(9.9548, grad_fn=<SumBackward0>)
- parameter weights after update: tensor([0., 1., 2., 3., 4.], requires_grad=True)
As you can see calling backward()
does not raise an error (see linked issue in comments above), but the parameters won't get updated either with SGD-Step.
One solution is to use the formula/trick given here: https://stats.stackexchange.com/a/342815/133099
x=μ+σ sample(N(0,1))
To archive this:
sigma = torch.ones(parameter_vector.size(0), dtype=torch.float)*0.1
init_result = torch.normal(parameter_vector, sigma)
Changes to:
dim = parameter_vector.size(0)
sigma = 0.1
init_result = parameter_vector + sigma*torch.normal(torch.zeros(dim), torch.ones(dim))
After changing these lines the code gets backprobable and the parameter vector gets updated after calling backward()
and SGD-Step.
Output with changed lines:
- initial parameter weights: tensor([0., 1., 2., 3., 4.], requires_grad=True)
- normal init result requires_grad: True
- normal init vector tensor([-0.1802, 0.9261, 1.9482, 3.0817, 3.9773], grad_fn=<ThAddBackward>)
- summed dummy-loss: tensor(9.7532, grad_fn=<SumBackward0>)
- parameter weights after update: tensor([-0.0100, 0.9900, 1.9900, 2.9900, 3.9900], requires_grad=True)
Another way would be using torch.distributions
(Documentation Link).
Do do so the respective lines in the code above have to be replaced by:
i = torch.ones(parameter_vector.size(0))
sigma = 0.1
m = torch.distributions.Normal(parameter_vector, sigma*i)
init_result = m.rsample()
Output with changed lines:
- initial parameter weights: tensor([0., 1., 2., 3., 4.], requires_grad=True)
- normal init result requires_grad: True
- normal init vector tensor([-0.0767, 0.9971, 2.0448, 2.9408, 4.1321], grad_fn=<ThAddBackward>)
- summed dummy-loss: tensor(10.0381, grad_fn=<SumBackward0>)
- parameter weights after update: tensor([-0.0100, 0.9900, 1.9900, 2.9900, 3.9900], requires_grad=True)
As it can be seen in the output above - using torch.distributions
yields also to backprobable code where the parameter vector gets updated after calling backward()
and SGD-Step.
I hope this is helpful for someone.