why `local_rank` is zero in DDP even I set visible CUDA as 2?

There are 3 GPUs in my system.

I want to run on the last one i.e. 2. For this reason, I set gpu_id as 2 in my configuration file as well as CUDA_VISIBLE_DEVICES=2. But in my program, the following line always assigns the 0th GPU.

        local_rank = torch.distributed.get_rank()
        torch.cuda.set_device(local_rank)

How to fix this issue?

Solution

When setting CUDA_VISIBLE_DEVICES=2 you tell the OS to only expose the third GPU to your process. That is, as far as PyTorch is concerned, there is only one GPU. Therefore torch.distributed.get_world_size() returns 1 (and not 3). The rank of this GPU, in your process, will be 0 - since there are no other GPUs available for the process. But as far as the OS is concerned - all processing are done on the third GPU that was allocated to the job.