Get local world size in torch distributed training

Suppose I have 2 machines with 4 GPUs each. Suppose that each instance of the training algorithm requires 2 GPUs. I would like to run 4 processes, 2 for each machine, each process using 2 GPUs.

How can I make each process retrieve the number of local processes running on the same machine? I can detect the world size with

torch.distributed.get_world_size()

and the global rank with

torch.distributed.get_rank()

But, given that I would like not to hard code parameters, is there a way to recover that on each node are running 2 processes? This will be usefull to me to assign GPUs to each process equally.

Example: Suppose I know that a machine has 4 GPUs and that there are 2 processes on it, I will assign GPUs [0, 1] to process with local rank 0 and GPUs [2, 3] to process with local rank 1. I know total number of processes but I cannot understand if they are on the same machine, so I cannot decide how many GPUs they are allowed to use.

I need a function that would be called torch.distributed.get_local_world_size()

Solution

torch.cuda.device_count() is essentially the local world size and could be useful in determining how many GPUs you have available on each device. If you can't do that for some reason, using plain MPI might help

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank() # device rank - [0,1]

torch.cuda.device(i)
ngpus = torch.cuda.device_count()
print(ngpus, " gpus on machine", i) # here's local world size for each process

but I think it would work just to call torch.cuda.device_count() in any case without adding this dependency. I am pretty new here so if you can, please let me know how this answer can be improved.