Here they mention the need to include optim.zero_grad()
when training to zero the parameter gradients. My question is: Could I do as well net.zero_grad()
and would that have the same effect? Or is it necessary to do optim.zero_grad()
. Moreover, what happens if I do both? If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added? In other words, what's the difference between doing optim.zero_grad()
and net.zero_grad()
. I am asking because here, line 115 they use net.zero_grad()
and it is the first time I see that, that is an implementation of a reinforcement learning algorithm, where one has to be especially careful with the gradients because there are multiple networks and gradients, so I suppose there is a reason for them to do net.zero_grad()
as opposed to optim.zero_grad()
.
net.zero_grad()
sets the gradients of all its parameters (including parameters of submodules) to zero. If you call optim.zero_grad()
that will do the same, but for all parameters that have been specified to be optimised. If you are using only net.parameters()
in your optimiser, e.g. optim = Adam(net.parameters(), lr=1e-3)
, then both are equivalent, since they contain the exact same parameters.
You could have other parameters that are being optimised by the same optimiser, which are not part of net
, in which case you would either have to manually set their gradients to zero and therefore keep track of all the parameters, or you can simply call optim.zero_grad()
to ensure that all parameters that are being optimised, had their gradients set to zero.
Moreover, what happens if I do both?
Nothing, the gradients would just be set to zero again, but since they were already zero, it makes absolutely no difference.
If I do none, then the gradients get accumulated, but what does that exactly mean? do they get added?
Yes, they are being added to the existing gradients. In the backward pass the gradients in respect to every parameter are calculated, and then the gradient is added to the parameters' gradient (param.grad
). That allows you to have multiple backward passes, that affect the same parameters, which would not be possible if the gradients were overwritten instead of being added.
For example, you could accumulate the gradients over multiple batches, if you need bigger batches for training stability but don't have enough memory to increase the batch size. This is trivial to achieve in PyTorch, which is essentially leaving off optim.zero_grad()
and delaying optim.step()
until you have gathered enough steps, as shown in HuggingFace - Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups.
That flexibility comes at the cost of having to manually set the gradients to zero. Frankly, one line is a very small cost to pay, even though many users won't make use of it and especially beginners might find it confusing.