understanding gpu usage huggingface classification

I am building a classifier using huggingface and would like to understand the line Total train batch size (w. parallel, distributed & accumulation) = 64 from below

 Num examples = 7000
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 16
  Total optimization steps = 327

i have 7000 rows of data, i have defined epochs to be 3 and per_device_train_batch_size = 4 and per_device_eval_batch_size= 16. I also get that Total optimization steps = 327 - (7000*3/64)

But I am not clear about Total train batch size (w. parallel, distributed & accumulation) = 64. Does it mean that there are 16 devices as 16*4(Instantaneous batch size per device = 4) comes to 64?

Solution

Well the variable used for printing that summary is this one: https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L1211.

The total train batch size is defined as train_batch_size * gradient_accumulation_steps * world_size, so in your case 4 * 16 * 1 = 64. world_size is always 1 except when you are using a TPU/training in parallel, see https://github.com/huggingface/transformers/blob/master/src/transformers/training_args.py#L1127.