Search code examples
pythondeep-learningpytorchneural-network

Exactly where in the training loss loop should zero_grad() be used? Does doing it before or after calculating the loss change things?


So I know you need to zero out the gradients before a backwards pass, because the reason for that is obvious. I'm confused about where to add zero_grad() otherwise though, as I've seen examples put it either at the start of the loop or just before loss.backward(), and I'm unable to tell which is correct, or if there's much of a difference at all.

I did try this and noticed a change in all of my accuracy calculations though which has me sort of curious as to what the reason might be, and if it's significant at all.


Solution

  • So, both cases are all before the backward() function. Then it doesn't matter where you place it. Actually you can put it before calling backward() and step() or after as long as not between backward() and step(). It actually depends on the codding style of the person who wrote it.

    Normally If you have only one optimization then people tend to put the opt.zero_grad() at the beginning of the loop. But if your model is a bit complex with more than one optimization. Then I tend to do something like:

    encoder_opt.zero_grad()
    encoder_loss = ...
    encoder_loss.backward()
    encoder_opt.step()
    
    decoder_opt.zero_grad()
    decoder_loss = ...
    decoder_loss.backward()
    decoder_opt.step()
    
    

    to make the code easier to read and make it clearer I guess.

    About your experiment, I don't think that putting it in different lines make it produce different result. As your model weight is initialized as well as updated differently when you re-train your model. You can set a specific seed for the reproducibility.