Search code examples
pythonpytorchdistributedmulti-gpu

How to fix 'RuntimeError: Address already in use' in PyTorch?


I am trying to run a distributive application with PyTorch distributive trainer. I thought I would first try the example they have, found here. I set up two AWS EC2 instances and configured them according to the description in the link, but when I try to run the code I get two different errors: in the first terminal window for node0 I get the error message: RuntimeError: Address already in use

Under the other three windows I get the same error message:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:272, unhandled system error

I followed the code in the link, and terminated the instances an redid but it didn't help/

This is using python 3.6 with the nightly build Cuda 9.0. I tried changing the MASTER_ADDR to the ip for node0 on both nodes, as well as using the same MASTER_PORT (which is an available, unused port). However I still get the same error message.

After running this, my goal is to the adjust this StyleGan implementation so that I can train it across multiple GPUs in two different nodes.


Solution

  • So after a lot of failed attempts I found out what the problem is. Note that this solution applies to using ASW deep learning instances.

    After creating two instances I had to adjust the security group. Add two rules: The first rule should be ALL_TCP, and set the source to the Private IPs of the leader. The second rule should be the same (ALL_TCP), but with the source as the Private IPs of the slave node.

    Previously, I had the setting security rule set as: Type SSH, which only had a single available port (22). For some reason I was not able to use this port to allow the nodes to communicate. After changing these settings the code worked fine. I was also able to run this with the above mentioned settings.