There are 3 GPUs in my system.
I want to run on the last one i.e. 2. For this reason, I set gpu_id
as 2 in my configuration file as well as CUDA_VISIBLE_DEVICES=2
. But in my program, the following line always assigns the 0th GPU.
local_rank = torch.distributed.get_rank()
How to fix this issue?
you tell the OS to only expose the third GPU to your process. That is, as far as PyTorch is concerned, there is only one GPU. Therefore torch.distributed.get_world_size()
returns 1
(and not 3).
The rank of this GPU, in your process, will be 0 - since there are no other GPUs available for the process. But as far as the OS is concerned - all processing are done on the third GPU that was allocated to the job.