How are training data 'batches' distributed to workers in Tensorflow?

I am running Distributed Tensorflow with the CIFAR10 example with up to 128 workers and 1 parameter server.

I was wondering if the FLAGS.batch_size determines the size of each batch sent to EACH worker, or if this FLAGS.batch_size determines the size of each batch sent to ALL workers?

This difference has performance implications as splitting a batch across too many workers can lead to too much communication and not enough computation.

Solution

The batch size in the distributed CIFAR10 example refers to the per-GPU batch size.

(But it's a good question to ask - some of the synchronous models refer to it as the aggregate batch size!)

Training a Keras model to identify leap years
Can batch normalization be considered a linear transformation?
Reduce inference time of object detection model by retraining with subset of original dataset
How to save a Dataset in multiple shards using `tf.data.Dataset.save`
why explain logit as 'unscaled log probabililty' in sotfmax_cross_entropy_with_logits?
How to improve the performance of CNN Model for a specific Dataset? Getting Low Accuracy on both training and Testing Dataset
InvalidArgumentError: No DNN in stream executor while training a TensorFlow RetinaNet model on Google Colab
how to improve the accuracy of autoencoder?
TypeError: Only integers, slices, ellipsis, tf.newaxis and scalar tf.int32/tf.int64 tensors are valid indices
tensorflow.keras only runs correctly once
Install Tensorflow in MacOs M1
Could not find a version that satisfies the requirement tensorflow
How do I use distributed DNN training in TensorFlow?
Loading tf.keras model, ValueError: The two structures don't have the same nested structure
Tensorflow is unable to train to predict simple multiplication
Why does tensorflow loss go to infinity with larger training set?
Tensorflow Probability MixtureNormal layer example not working as in example
how to get string value out of tf.tensor which dtype is string
How to predict list elements outside the bounds of a py dataframe?
Load model from model.weights.h5 file stored in Azure Blob
the number and the name of the event files in tensorflow?
Meaning of sparse in "sparse cross entropy loss"?
Error --accelerator unrecognized argument when launching gcloud beta ai-platform versions create API
Change the threshold value of the keras RELU activation function
How to implement tf.gather_nd in Pytorch with the argument batch_dims?
Pipenv fails locking when installing TensorFlow 2.4.1
Tensorflow dataset splitted sizing parameter problem: Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
Difference between "compute capability" "cuda architecture" clarification for using Tensorflow v2.3.0
problem with importing @tensorflow/tfjs-node while working with face-api.js package (node.js)
How to optimize multiple loss functions separately in Keras?