Search code examples
pythonneural-networkdeep-learningpytorchbackpropagation

PyTorch - Effect of normal() initialization on gradients


Suppose I have a neural network where I use a normal distribution initialization and I want to use the mean value which is used for initialization as a parameter of the network.

I have a small example:

import torch
parameter_vector = torch.tensor(range(10), dtype=torch.float, requires_grad=True)
sigma = torch.ones(parameter_vector.size(0), dtype=torch.float)*0.1
init_result = torch.normal(parameter_vector, sigma)
print('requires_grad:', init_result.requires_grad)
print('result:       ', init_result)

This results in:

requires_grad: True
result:        tensor([ 0.1026,  0.9183,  1.9586,  3.1778,  4.0538,  4.8056,  5.9561,
         6.9501,  7.7653,  8.9583])

So the requires_grad flag was obviously taken over from the mean value tensor resp. parameter_vector.

But does this automatically mean that the parameter_vector will be updated through backward() in a larger network where init_result does affect the end result?

Especially as normal() does not really seem like normal operation because it involves randomness.


Solution

  • Thanks to @iacolippo (see comments below the question) the problem is solved now. I just wanted to supplement this by posting what code I am using now, so this may help anyone else.

    As presumed in the question and also stated by @iacolippo the code posted in the question is not backpropable:

    import torch
    parameter_vector = torch.tensor(range(5), dtype=torch.float, requires_grad=True)
    print('- initial parameter weights:', parameter_vector)
    sigma = torch.ones(parameter_vector.size(0), dtype=torch.float)*0.1
    init_result = torch.normal(parameter_vector, sigma)
    print('- normal init result requires_grad:', init_result.requires_grad)
    print('- normal init vector', init_result)
    #print('result:       ', init_result)
    sum_result = init_result.sum()
    sum_result.backward()
    print('- summed dummy-loss:', sum_result)
    optimizer = torch.optim.SGD([parameter_vector], lr = 0.01, momentum=0.9)
    optimizer.step()
    print()
    print('- parameter weights after update:', parameter_vector)
    

    Out:

    - initial parameter weights: tensor([0., 1., 2., 3., 4.], requires_grad=True)
    - normal init result requires_grad: True
    - normal init vector tensor([-0.0909,  1.1136,  2.1143,  2.8838,  3.9340], grad_fn=<NormalBackward3>)
    - summed dummy-loss: tensor(9.9548, grad_fn=<SumBackward0>)
    
    - parameter weights after update: tensor([0., 1., 2., 3., 4.], requires_grad=True)
    

    As you can see calling backward() does not raise an error (see linked issue in comments above), but the parameters won't get updated either with SGD-Step.


    Working Example 1

    One solution is to use the formula/trick given here: https://stats.stackexchange.com/a/342815/133099

    x=μ+σ sample(N(0,1))

    To archive this:

    sigma = torch.ones(parameter_vector.size(0), dtype=torch.float)*0.1
    init_result = torch.normal(parameter_vector, sigma)
    

    Changes to:

    dim = parameter_vector.size(0)
    sigma = 0.1
    init_result = parameter_vector + sigma*torch.normal(torch.zeros(dim), torch.ones(dim))
    

    After changing these lines the code gets backprobable and the parameter vector gets updated after calling backward() and SGD-Step.

    Output with changed lines:

    - initial parameter weights: tensor([0., 1., 2., 3., 4.], requires_grad=True)
    - normal init result requires_grad: True
    - normal init vector tensor([-0.1802,  0.9261,  1.9482,  3.0817,  3.9773], grad_fn=<ThAddBackward>)
    - summed dummy-loss: tensor(9.7532, grad_fn=<SumBackward0>)
    
    - parameter weights after update: tensor([-0.0100,  0.9900,  1.9900,  2.9900,  3.9900], requires_grad=True)
    

    Working Example 2

    Another way would be using torch.distributions (Documentation Link).

    Do do so the respective lines in the code above have to be replaced by:

    i = torch.ones(parameter_vector.size(0))
    sigma = 0.1
    m = torch.distributions.Normal(parameter_vector, sigma*i)
    init_result = m.rsample()
    

    Output with changed lines:

    - initial parameter weights: tensor([0., 1., 2., 3., 4.], requires_grad=True)
    - normal init result requires_grad: True
    - normal init vector tensor([-0.0767,  0.9971,  2.0448,  2.9408,  4.1321], grad_fn=<ThAddBackward>)
    - summed dummy-loss: tensor(10.0381, grad_fn=<SumBackward0>)
    
    - parameter weights after update: tensor([-0.0100,  0.9900,  1.9900,  2.9900,  3.9900], requires_grad=True)
    

    As it can be seen in the output above - using torch.distributions yields also to backprobable code where the parameter vector gets updated after calling backward() and SGD-Step.

    I hope this is helpful for someone.