When running tensorflow benchmarks from terminal, there are a couple of parameters we can specify. There is a parameter called gradient_repacking. What does it represent and how would one think about setting it?
python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 \
--model=resnet50 --optimizer=momentum --variable_update=replicated \
--nodistortions --gradient_repacking=8 --num_gpus=8 \
--num_epochs=90 --weight_decay=1e-4 --data_dir=${DATA_DIR} --use_fp16 \
--train_dir=${CKPT_DIR}
For those searching in the future, gradient_repacking affects all-reduce in replicated mode. From the flags definition:
flags.DEFINE_integer('gradient_repacking', 0, 'Use gradient repacking. It'
'currently only works with replicated mode. At the end of'
'of each step, it repacks the gradients for more efficient'
'cross-device transportation. A non-zero value specifies'
'the number of split packs that will be formed.',
lower_bound=0)
As for the optimal, I've seen gradient_repacking=8
as you have and gradient_repacking=2
.
My best guess is the parameter refers to the number of shards the gradients get broken down into for sharing among other workers. Eight in this case would seem to mean each GPU shares with each other GPU (i.e. all-to-all) (for your num_gpus=8
) while 2 would mean sharing only with neighbors in a ring fashion.
Given that Horovod uses its own all reduce algorithm, it makes sense that setting gradient_repacking
has no effect when --variable_update=horovod
.