Search code examples
pythontensorflowtensorflow2.0distributed-computingbatchsize

TensorFlow's Mirrored strategy, batch size and Back Propagation


i'm dealing with the training of a Neural Network on a multi-gpu server. I'm using the MirroredStrategy API from TensorFlow 2.1 and i'm getting a lil confused.

I have 8 GPUs (Nvidia V100 32GB)

  • I'm specifying a batch size of 32 (how is it managed? Each gpu will have a batch of 32 samples? Should i specify 256 as batch size -32x8- ?)
  • When and how is Back-propagation applied? I've read that the MirroredStrategy is synchronous: does it imply that after the forward step all batches are grouped into one batch of size 32x8 and after that back-propagation is applied? Or Back-prop is applied once for each batch of size 32 in a sequential manner?

I really want to be sure on what kind of experiments i submit to the server since each training job is really time consuming and having the batch size to change (and back-prop) based on the number of available GPUs affects results correctness.

Thank you for any help provided.


Solution

  • When using MirroredStrategy, the batch size refers to the global batch size. You can see in the docs here

    For instance, if using MirroredStrategy with 2 GPUs, each batch of size 10 will get divided among the 2 GPUs, with each receiving 5 input examples in each step.

    So in your case if you want each GPU to process 32 samples per step, you can set the batch size as 32 * strategy.num_replicas_in_sync.

    Each GPU will compute the forward and backward passes through the model on a different slice of the input data. The computed gradients from each of these slices are then aggregated across all of the devices and reduced (usually an average) in a process known as AllReduce. The optimizer then performs the parameter updates with these reduced gradients thereby keeping the devices in sync.