Search code examples
pythonpytorchdistributed-training

how to know how many GPUs are used in pytorch?


The bash file I used to launch the training looks like this:

CUDA_VISIBLE_DEVICES=3,4 python -m torch.distributed.launch \
--nproc_per_node=2  train.py \
--batch_size 6 \
--other_args

I found that the batch size of tensors in each GPU is acctually batch_size / num_of_gpu = 6/2 = 3.

When I initialize my network, I need to know the batch size in each GPU. (Ps. in this phase, I can't use input_tensor.shape to get the size of batch-dimension, since there are no data fed in jet.)

Somehow I could not find where does the pytorch store the parameter --nproc_per_node. So how could I know how many GPUs are used, without passing it manually as --other_args?


Solution

  • I think you are looking for torch.distributed.get_world_size() - this will tell you how many processes were created.