python tensorflow tensorflow2.0 distributed-computing batchsize

TensorFlow's Mirrored strategy, batch size and Back Propagation

i'm dealing with the training of a Neural Network on a multi-gpu server. I'm using the MirroredStrategy API from TensorFlow 2.1 and i'm getting a lil confused.

I have 8 GPUs (Nvidia V100 32GB)

I'm specifying a batch size of 32 (how is it managed? Each gpu will have a batch of 32 samples? Should i specify 256 as batch size -32x8- ?)
When and how is Back-propagation applied? I've read that the MirroredStrategy is synchronous: does it imply that after the forward step all batches are grouped into one batch of size 32x8 and after that back-propagation is applied? Or Back-prop is applied once for each batch of size 32 in a sequential manner?

I really want to be sure on what kind of experiments i submit to the server since each training job is really time consuming and having the batch size to change (and back-prop) based on the number of available GPUs affects results correctness.

Thank you for any help provided.

Solution

When using MirroredStrategy, the batch size refers to the global batch size. You can see in the docs here

For instance, if using MirroredStrategy with 2 GPUs, each batch of size 10 will get divided among the 2 GPUs, with each receiving 5 input examples in each step.

So in your case if you want each GPU to process 32 samples per step, you can set the batch size as 32 * strategy.num_replicas_in_sync.

Each GPU will compute the forward and backward passes through the model on a different slice of the input data. The computed gradients from each of these slices are then aggregated across all of the devices and reduced (usually an average) in a process known as AllReduce. The optimizer then performs the parameter updates with these reduced gradients thereby keeping the devices in sync.