i'm dealing with the training of a Neural Network on a multi-gpu server. I'm using the MirroredStrategy API from TensorFlow 2.1 and i'm getting a lil confused.
I have 8 GPUs (Nvidia V100 32GB)
I really want to be sure on what kind of experiments i submit to the server since each training job is really time consuming and having the batch size to change (and back-prop) based on the number of available GPUs affects results correctness.
Thank you for any help provided.
When using MirroredStrategy, the batch size refers to the global batch size. You can see in the docs here
For instance, if using MirroredStrategy with 2 GPUs, each batch of size 10 will get divided among the 2 GPUs, with each receiving 5 input examples in each step.
So in your case if you want each GPU to process 32 samples per step, you can set the batch size as 32 * strategy.num_replicas_in_sync
.
Each GPU will compute the forward and backward passes through the model on a different slice of the input data. The computed gradients from each of these slices are then aggregated across all of the devices and reduced (usually an average) in a process known as AllReduce. The optimizer then performs the parameter updates with these reduced gradients thereby keeping the devices in sync.