When running a distributed PyTorch Lightning training job in multiple Docker containers (e.g., via Slurm), NCCL fails to initialize inter-process communication between containers running on the same host, but has no problem when the containers run on different hosts. Why is this and how can it be fixed?
Command for each PyTorch Lightning instance:
$ docker run ...
Logs:
...
0: aws-p4d-02:1:14 [0] transport/p2p.cc:136 NCCL WARN Cuda failure 'invalid device context'
0: aws-p4d-02:1:14 [0] NCCL INFO transport/p2p.cc:238 -> 1
0: aws-p4d-02:1:14 [0] NCCL INFO transport.cc:111 -> 1
0: aws-p4d-02:1:14 [0] NCCL INFO init.cc:778 -> 1
0: aws-p4d-02:1:14 [0] NCCL INFO init.cc:904 -> 1
...
0: Traceback (most recent call last):
0: File "/.../script.py", line 81, in <module>
0: main()
0: File "/.../script.py", line 70, in main
0: td.all_reduce(a) # <--- ncclUnhandledCudaError: Call to CUDA function failed.
0: File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1320, in all_reduce
0: work = default_pg.allreduce([tensor], opts)
0: RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled cuda error, NCCL version 2.10.3
0: ncclUnhandledCudaError: Call to CUDA function failed.
...
Disabling the PID (process ID) namespace on the Docker containers fixes it (docker run --pid=host ...
). The below stack trace from the logs points to a line in the NCCL source which runs after a branch which compares PIDs. A PID namespace maps PIDs outside the namespace to different PIDs inside the namespace. So the same PID can represent different processes in different namespaces on the same host.
0: aws-p4d-02:1:14 [0] transport/p2p.cc:136 NCCL WARN Cuda failure 'invalid device context'
0: aws-p4d-02:1:14 [0] NCCL INFO transport/p2p.cc:238 -> 1
0: aws-p4d-02:1:14 [0] NCCL INFO transport.cc:111 -> 1
0: aws-p4d-02:1:14 [0] NCCL INFO init.cc:778 -> 1
0: aws-p4d-02:1:14 [0] NCCL INFO init.cc:904 -> 1