Search code examples
python-3.xtensorflowmachine-learningkerasadam

How do we have access to the effective learning rate of Adam [Tensorflow]?


I am interested in the effective learning rate of Adam. We know that Adam is roughly formed by a initial/constant learning rate divided by tthe sum of the past gradients of the loss (see here for details). The matter of the question is that it has an adaptive contribution which acts on a constant initial learning rate.

Starting from the optimizer definition:

my_optimizer = tf.keras.optimizers.Adam(initial_learning_rate, beta_1 = my_beta_1, beta_2 = my_beta_2)

using the following lines we can easily print the constant part of the Adam learning rate.

  • my_optimizer.learning_rate

  • my_optimizer.lr

  • keras.backend.get_value(my_optimizer.lr)

  • my_optimizer._decayed_lr(tf.float32)

Or we can modify the learning rate value through:

keras.backend.set_value(my_optimizer.lr, my_new_learning_rate) 

These expressions work well with fixed learning rate optimizers like the stochastic gradient descent.

There is this question in which zihaozhihao proposed to directly calculate the value of the learning rate using the definition of Adam. I was looking for an easier way, just like the expressions mentioned above since, as I said in the question title, I want to both to print and to modify the effective learning rate.

My question is: what is the tensorflow function which gives you the access to the value of the effective learning rate of Adam?

Printing because I want to monitor and modifying because I want to add constraints to its variations since Adam can sometimes be unstable (due to the fact it is adaptive).


Solution

  • It appears it's not possible in the current implementation.

    See, tf.keras.optimizers.Adam is implemented using OptimizerV2 interface, with the main computation apparently happening in the _resource_apply_dense and _resource_apply_sparse functions. The first function relies on C++ implementations such as ResourceApplyAdam::Compile, and the second is written in Tensorflow. Crucially, both functions both compute the effective learning rate and perform the gradient step, so there's no place where one could alter the return values.

    For example, see C++ implementation for dense variables:

    xla::XlaOp alpha = lr * xla::Sqrt(one - beta2_power) / (one - beta1_power);
    auto m_t = m + (grad - m) * (one - beta1);
    v = v + (xla::Square(grad) - v) * (one - beta2);
    if (use_nesterov_) {
      var = var - alpha * (m_t * beta1 + (one - beta1) * grad) /
                      (xla::Sqrt(v) + epsilon);
    } else {
      var = var - m_t * alpha / (xla::Sqrt(v) + epsilon);
    }
    

    There's no "compute effective learning rate" function calls anywhere, it's not even stored in any variables.

    Therefore your best bet is to reimplement the optimizer (maybe by forking the original code), and adding the feature you're interested by yourself.