Pytorch not accepting GPU devices

I am working on DeepLab V3 model from the repo mentioned in the link below.

link: https://github.com/jfzhang95/pytorch-deeplab-xception

When I am trying to run on more than one GPU, I am getting the following error.

Traceback (most recent call last):
  File "train.py", line 313, in <module>
    main()
  File "train.py", line 302, in main
    trainer = Trainer(args)
  File "train.py", line 76, in __init__
    self.model = torch.nn.DataParallel(self.model, device_ids=[1,2,3]).to(args.cuda)
  File "/home/deshpand/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 145, in __init__
    _check_balance(self.device_ids)
  File "/home/deshpand/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 25, in _check_balance
    dev_props = _get_devices_properties(device_ids)
  File "/home/deshpand/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/_utils.py", line 577, in _get_devices_properties
    return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids]
  File "/home/deshpand/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/_utils.py", line 577, in <listcomp>
    return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids]
  File "/home/deshpand/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/_utils.py", line 558, in _get_device_attr
    return get_member(torch.cuda)
  File "/home/deshpand/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/_utils.py", line 577, in <lambda>
    return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids]
  File "/home/deshpand/anaconda3/envs/torch_env/lib/python3.8/site-packages/torch/cuda/__init__.py", line 374, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id

The line for which i am getting this error is mentioned below.

self.model = torch.nn.DataParallel(self.model, device_ids=[1,2,3])

I am using os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3" in the program.Still I am facing this issue.

Can someone please help me with this?

Solution

At least on my system (Ubuntu 20.04 running torch 1.12.1+cu102) it seems that torch.cuda indexes devices starting from zero based on available devices during initialization, and not on the system device id. So when you hide the first GPU with os.environ and import torch, it thinks that the second, third and fourth devices are actually cuda:0, cuda:1 and cuda:2, hence throwing an error when you try to access cuda:3.