I am building a classifier using huggingface and would like to understand the line Total train batch size (w. parallel, distributed & accumulation) = 64
from below
Num examples = 7000
Num Epochs = 3
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 16
Total optimization steps = 327
i have 7000 rows of data, i have defined epochs to be 3 and per_device_train_batch_size = 4
and per_device_eval_batch_size= 16
. I also get that Total optimization steps = 327
- (7000*3/64)
But I am not clear about Total train batch size (w. parallel, distributed & accumulation) = 64
. Does it mean that there are 16 devices as 16*4(Instantaneous batch size per device = 4
) comes to 64?
Well the variable used for printing that summary is this one: https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L1211.
The total train batch size is defined as train_batch_size * gradient_accumulation_steps * world_size
, so in your case 4 * 16 * 1 = 64
. world_size
is always 1 except when you are using a TPU/training in parallel, see https://github.com/huggingface/transformers/blob/master/src/transformers/training_args.py#L1127.