Search code examples
tensorflowoptimizationkerasadam

Is Adam optimizer really RMSprop plus momentum? If yes, why it doesn't have a momentum parameter?


Here is a link to tensorflow optimizers. And there you can see, that RMSprop takes momentum as argument while Adam does not do this. So I am confuzed. Adam optimization pretends to be a RMSprop optimization with momentum, like this:

Adam = RMSprop + Momentum

But why then RMSprop does have the momentum parameter and Adam does not?


Solution

  • Although the expression "Adam is RMSProp with momentum" is widely used indeed, it is just a very rough shorthand description, and it should not be taken at face value; already in the original Adam paper, it was explicitly clarified (p. 6):

    There are a few important differences between RMSProp with momentum and Adam: RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are directly estimated using a running average of first and second moment of the gradient.

    Sometimes, authors make clear that the subject expression is just a loose description, e.g. in the (highly recommended) Overview of gradient descent optimization algorithms (emphasis added):

    Adam also keeps an exponentially decaying average of past gradients mt, similar to momentum.

    or in Stanford CS231n: CNNs for Visual Recognition (again, emphasis added):

    Adam is a recently proposed update that looks a bit like RMSProp with momentum.

    That said, it's true that some other frameworks actually include a momentum parameter for Adam, but this is actually the beta1 parameter; here is CNTK:

    momentum (float, list, output of momentum_schedule()) – momentum schedule. Note that this is the beta1 parameter in the Adam paper. For additional information, please refer to the this CNTK Wiki article.

    So, don't take this too literally, and don't loose your sleep over it.