My batch size is 512, I have 8 GPUs
Should I define: rescale_grad = 1. / 512 or rescale_grad = 1. / (8*512)
Thanks!
Batch size is something that is tied to the computer and not to the GPU. Quote (from here):
Workload Partitioning
By default, MXNet partitions a data batch evenly among the available GPUs. Assume a batch size b and assume there are k GPUs, then in one iteration each GPU will perform forward and backward on b/k examples. The gradients are then summed over all GPUs before updating the model.
In your case b
is 512. Therefore you should be using rescale_grad = 1. / 512