Search code examples
pytorchdistributed-training

On batch size, epochs, and learning rate of DistributedDataParallel


I have read these threads [1] [2] [3] [4], and this article.
I think I got how batch size and epochs works with DDP, but I am not sure about the learning rate.

Let's say I have a dataset of 100 * 8 images. In a non-distributed scenario, I set the batch size to 8, so each epoch will do 100 gradient steps.

Now I am in a multi-node multi-gpu scenario, with 2 nodes and 4 GPUs (so world size is 8).
I understand that I need to pass batches of 8 / 8 = 1, because each update will aggregate the gradients from the 8 GPUs. In each worker, the data loader will load still 100 batches, but each of 1 sample. So the whole dataset is parsed exactly once per epoch.
I checked and everything seems like that.

But what about the learning rate? According to the official doc

When a model is trained on M nodes with batch=N, the gradient will be M times smaller when compared to the same model trained on a single node with batch=M*N if the loss is summed (NOT averaged as usual) across instances in a batch (because the gradients between different nodes are averaged). [...] But in most cases, you can just treat a DistributedDataParallel wrapped model, a DataParallel wrapped model and an ordinary model on a single GPU as the same (E.g. using the same learning rate for equivalent batch size).

I understand that the gradients are averaged, so if the loss is averaged over samples nothing changes, while if it is summer we need to account for that. But does 'nodes' refer to the total number of GPUs across all cluster nodes (world size) or just cluster nodes? In my example, would M be 2 or 8? Some posts in the threads I linked say that the gradient is divided 'by the number of GPUs'. How exactly is the gradient aggregated?


Solution

  • Please refer to the following discussion:

    https://github.com/PyTorchLightning/pytorch-lightning/discussions/3706

    "As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.

    Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes

    In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512."