Search code examples
pytorchbackpropagationgradient

How to properly update the weights in PyTorch?


I'm trying to implement the gradient descent with PyTorch according to this schema but can't figure out how to properly update the weights. It is just a toy example with 2 linear layers with 2 nodes in hidden layer and one output.

Learning rate = 0.05;
target output = 1

https://hmkcode.github.io/ai/backpropagation-step-by-step/

Forward

Backward

My code is as following:

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import torch.optim as optim

    class MyNet(nn.Module):

    def __init__(self):
         super(MyNet, self).__init__()
         self.linear1 = nn.Linear(2, 2,  bias=None)
         self.linear1.weight = torch.nn.Parameter(torch.tensor([[0.11, 0.21], [0.12, 0.08]]))
         self.linear2 = nn.Linear(2, 1,  bias=None)
         self.linear2.weight = torch.nn.Parameter(torch.tensor([[0.14, 0.15]]))

    def forward(self, inputs):
         out = self.linear1(inputs)
         out = self.linear2(out)
         return out

    losses = []
    loss_function = nn.L1Loss()
    model = MyNet()
    optimizer = optim.SGD(model.parameters(), lr=0.05)
    input = torch.tensor([2.0,3.0])
    print('weights before backpropagation = ',   list(model.parameters()))
    for epoch in range(1):
       result = model(input )
       loss = loss_function(result , torch.tensor([1.00],dtype=torch.float))
       print('result = ', result)
       print("loss = ",   loss)
       model.zero_grad()
       loss.backward()
       print('gradients =', [x.grad.data  for x in model.parameters()] )
       optimizer.step()
       print('weights after backpropagation = ',   list(model.parameters())) 

The result is following :

    weights before backpropagation =  [Parameter containing:
    tensor([[0.1100, 0.2100],
            [0.1200, 0.0800]], requires_grad=True), Parameter containing:
    tensor([[0.1400, 0.1500]], requires_grad=True)]

    result =  tensor([0.1910], grad_fn=<SqueezeBackward3>)
    loss =  tensor(0.8090, grad_fn=<L1LossBackward>)

    gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]), 
                 tensor([[-0.8500, -0.4800]])]

    weights after backpropagation =  [Parameter containing:
    tensor([[0.1240, 0.2310],
            [0.1350, 0.1025]], requires_grad=True), Parameter containing:
    tensor([[0.1825, 0.1740]], requires_grad=True)]

Forward pass values:

2x0.11 + 3*0.21=0.85 ->  
2x0.12 + 3*0.08=0.48 -> 0.85x0.14 + 0.48*0.15=0.191 -> loss =0.191-1 = -0.809  

Backward pass: let's calculate w5 and w6 (output node weights)

w = w - (prediction-target)x(gradient)x(output of previous node)x(learning rate)  
w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174  
w6= 0.15 -(0.191-1)*1*0.48*0.05= 0.15 + 0.019= 0.169 

In my example Torch doesn't multiply the loss by derivative so we get wrong weights after updating. For the output node we got new weights w5,w6 [0.1825, 0.1740] , when it should be [0.174, 0.169]

Moving backward to update the first weight of the output node (w5) we need to calculate: (prediction-target)x(gradient)x(output of previous node)x(learning rate)=-0.809*1*0.85*0.05=-0.034. Updated weight w5 = 0.14-(-0.034)=0.174. But instead pytorch calculated new weight = 0.1825. It forgot to multiply by (prediction-target)=-0.809. For the output node we got gradients -0.8500 and -0.4800. But we still need to multiply them by loss 0.809 and learning rate 0.05 before we can update the weights.

What is the proper way of doing this? Should we pass 'loss' as an argument to backward() as following: loss.backward(loss) .

That seems to fix it. But I couldn't find any example on this in documentation.


Solution

  • You should use .zero_grad() with optimizer, so optimizer.zero_grad(), not loss or model as suggested in the comments (though model is fine, but it is not clear or readable IMO).

    Except that your parameters are updated fine, so the error is not on PyTorch's side.

    Based on gradient values you provided:

    gradients = [tensor([[-0.2800, -0.4200], [-0.3000, -0.4500]]), 
                 tensor([[-0.8500, -0.4800]])]
    

    Let's multiply all of them by your learning rate (0.05):

    gradients_times_lr = [tensor([[-0.014, -0.021], [-0.015, -0.0225]]), 
                          tensor([[-0.0425, -0.024]])]
    

    Finally, let's apply ordinary SGD (theta -= gradient * lr), to get exactly the same results as in PyTorch:

    parameters = [tensor([[0.1240, 0.2310], [0.1350, 0.1025]]),
                  tensor([[0.1825, 0.1740]])]
    

    What you have done is taken the gradients calculated by PyTorch and multiplied them with the output of previous node and that's not how it works!.

    What you've done:

    w5= 0.14 -(0.191-1)*1*0.85*0.05= 0.14 + 0.034= 0.174  
    

    What should of been done (using PyTorch's results):

    w5 = 0.14 - (-0.85*0.05) = 0.1825
    

    No multiplication of previous node, it's done behind the scenes (that's what .backprop() does - calculates correct gradients for all of the nodes), no need to multiply them by previous ones.

    If you want to calculate them manually, you have to start at the loss (with delta being one) and backprop all the way down (do not use learning rate here, it's a different story!).

    After all of them are calculated, you can multiply each weight by optimizers learning rate (or any other formula for that matter, e.g. Momentum) and after this you have your correct update.

    How to calculate backprop

    Learning rate is not part of backpropagation, leave it alone until you calculate all of the gradients (it confuses separate algorithms together, optimization procedures and backpropagation).

    1. Derivative of total error w.r.t. output

    Well, I don't know why you are using Mean Absolute Error (while in the tutorial it is Mean Squared Error), and that's why both those results vary. But let's go with your choice.

    Derivative of | y_true - y_pred | w.r.t. to y_pred is 1, so IT IS NOT the same as loss. Change to MSE to get equal results (here, the derivative will be (1/2 * y_pred - y_true), but we usually multiply MSE by two in order to remove the first multiplication).

    In MSE case you would multiply by the loss value, but it depends entirely on the loss function (it was a bit unfortunate that the tutorial you were using didn't point this out).

    2. Derivative of total error w.r.t. w5

    You could probably go from here, but... Derivative of total error w.r.t to w5 is the output of h1 (0.85 in this case). We multiply it by derivative of total error w.r.t. output (it is 1!) and obtain 0.85, as done in PyTorch. Same idea goes for w6.

    I seriously advise you not to confuse learning rate with backprop, you are making your life harder (and it's not easy with backprop IMO, quite counterintuitive), and those are two separate things (can't stress that one enough).

    This source is nice, more step-by-step, with a little more complicated network idea (activations included), so you can get a better grasp if you go through all of it.

    Furthermore, if you are really keen (and you seem to be), to know more ins and outs of this, calculate the weight corrections for other optimizers (say, nesterov), so you know why we should keep those ideas separated.