Effective learning rate when using tf.distribute.MirroredStrategy (one host, multi-GPU)

When using tf.distribute.MirroredStrategy (one host, multi-GPU) the effective learning rate is the desired learning rate scaled by the number of GPUs (multiplying the learning rate by the number of GPUs) or is just the learning rate desired when using just one GPU?

For example, if I want an learning rate = 1E-3 when using 1 GPU, I just use learning rate = 1E-3 (without using tf.distribute.MirroredStrategy); if I use tf.distribute.MirroredStrategy with 8 GPUs should I set learning rate = 8E-3 (8 * 1E-3), the same way I should multiply the batch size by 8 when I'm scaling to 8 GPUs, or should I just use 1E-3 as the learning rate?

Thanks in advance!

Solution

The general advice is to use a larger learning rate when using a larger batch size, as each steps processes more data.

From the guide Distributed training with Keras:

For larger datasets, the key benefit of distributed training is to learn more in each training step, because each step processes more training data in parallel, which allows for a larger learning rate (within the limits of the model and dataset).

But as mentioned in the quote, this depends on the size of your dataset and your model. A small model (low number of parameters) might not respond well to a too high learning rate.

See also this question for a more in depth explanation on how to scale learning rate to the batch size: How should the learning rate change as the batch size change?