Search code examples
pythondistributed-computingrayaws-batchray-train

Running Ray on top of AWS Batch multi-node?


I am interested in running Ray on AWS Batch multi-node. This is a pattern that hasn't been done before on Ray, and thus, there's no documentation on it. But, I'd really like to try it since Ray can be installed on-premise as well.

I stood up the AWS Batch multi-node gang-scheduled closer and ran the following commands:

  1. For the head node:
subprocess.Popen(f"ray start --head --node-ip-address {current.parallel.main_ip} --port {master_port} --block", shell=True).wait()
  1. For the worker nodes:
import ray
node_ip_address = ray._private.services.get_node_ip_address()
subprocess.Popen(f"ray start --node-ip-address {node_ip_address} --address {current.parallel.main_ip}:{master_port} --block", shell=True).wait()

The head node seems to be working, but there's some issue with the worker nodes not syncing with the head node.

I get the following output in stderr:

[2023-07-28 09:25:55,500 I 427 427] global_state_accessor.cc:356: This node has an IP address of 10.14.52.21, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.

Any insight on how I can get Ray working on AWS Batch multi-node would be much appreciated!


Solution

  • Seemed to be an issue with pydantic. I downgraded pydantic version to 1.10.12 and after that, it seemed to work like a charm.

    Also, for AWS Batch, the worker nodes need to be kept alive. So there needs to be a heartbeat where the worker nodes ping the head node to check if the job is complete, and if not, then you execute time.sleep