python deep-learning pytorch gradient-descent

Use of scheduler with self-ajusting optimizers in PyTorch

In PyTorch, the weight adjustment policy is determined by the optimizer, and the learning rate is adjusted with a scheduler. When the optimizer is SGD, there is only one learning rate and this is straightforward. When using Adagrad, Adam, or any similar optimizer which inherently adjusts the learning rate on a per-parameter basis, is there something in particular to look out for? Can I ignore the scheduler completely since the algorithm adjusts its own learning rates? Should I parameterize it very differently than if I'm using SGD?

Solution

The learning rate you define for optimizers like ADAM are upper bounds. You can see this in the paper in Section 2.1. The stepsize α in the paper is the learning rate.

The effective magnitude of the steps taken in parameter space at each are approximately bounded by the stepsize setting α

Also this stepsize α is directly used and multiplied with the step size correction, which is learned. So changing the learning rate e.g. reducing it will reduce all individual learning rates and reduces the upper bound. This can be helpful during the "end" of an training, to reduce the overall step sizes, so only smaller steps occur and might help the network to find a minima in the loss function.

I saw learning rate decay in some papers using ADAM and used it myself and it did help. What I found is that you should do it slower than e.g. with SGD. With one model I just multiply it with 0.8 every 10 epochs. So it is a gradual decay which I think works better than more drastic steps since you don't "invalidate" the estimated momentums to much. But this is just my theory.