Search code examples
amazon-web-servicesmultiprocessingdistributed-computingray

Ray cluster launch on aws with yaml fails due to root permission issue


I am trying to launch a ray cluster using the yaml file below, but I am getting this error message:

bash: /root/ray_bootstrap_config.yaml: Permission denied

I think it may be due to a permissions required to access my root folder locally from where I launch the cluster. If i go to this folder locally as shown in image, credentials are required when click on root: click here for image

there is some indictaion online that I need to do filemounting, but so far I have been unable to do this.

resource: https://github.com/ray-project/ray/issues/9326

The cluster launches initially, but this error occurs when running the yaml file. It connects to aws successfully luanching the head and worker nodes, first installs a few dependencies eg boto ect as shown in initilization_commands sucessfully, but then comes stuck on the error shown.

This is my Yaml:

# An unique identifier for the head node and workers of this cluster.
cluster_name: ray-pipeline-test #ray_example_aws

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers default to 0.
max_workers: 1

docker:
    image: "xxxxxxxx1546.dkr.ecr.eu-west-2.amazonaws.com/xxxxx/pipeline:ray-aws" 
 
    container_name: "ray_xxxxxxx_pipeline_aws"      #"ray_nvidia_docker" # e.g. ray_docker
    pull_before_run: True

idle_timeout_minutes: 5



# Cloud-provider specific configuration.
provider:
    type: aws
    region: eu-west-2
    availability_zone: eu-west-2a

initialization_commands:

      #- conda install python==3.6
#      - wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
#      - bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
#      - echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
#      - conda create -n py36 python=3.6 anaconda

      #- wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
     # - sh Miniconda3-latest-Linux-x86_64.sh
      - source .bashrc
      - conda update conda -n base
      - conda create -n py36 python=3.6
      - conda activate py36


      - curl -fsSL https://get.docker.com -o get-docker.sh
      - sudo sh get-docker.sh
      - sudo usermod -aG docker $USER
      - sudo systemctl restart docker -f



      - sudo apt-get update
      - sudo apt-get upgrade
      - sudo apt-get install -y python-setuptools
      - sudo apt-get install -y build-essential curl unzip psmisc
      - pip install boto boto3
      - conda install boto boto3
      - pip install awscli
      - sudo pip install --default-timeout=100 future
      - pip install ray==1.0.1.post1
      - aws configure set aws_access_key_id xxxxxxxxxxx
      - aws configure set aws_secret_access_key xxxxxxxxxxxxxxxxxxxxx
      - eval $(aws ecr get-login --no-include-email --region eu-west-2)

auth:
    ssh_user:  ubuntu
    ssh_private_key: /home/user/.ssh/aws_ubuntu_test.pem

head_node:
    InstanceType: c5.2xlarge
    ImageId: ami-xxxxxxxb31fd2c
    KeyName: aws_ubuntu_test

    BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs:
          VolumeSize: 200


worker_nodes:
   InstanceType: c5.2xlarge
   ImageId: ami-xxxxxxxxx31fd2c
   KeyName: aws_ubuntu_test

Solution

  • When using a custom docker image with Ray, you should make sure it's based off of the rayproject/ray image, because Ray's autoscaler has a lot of expectations about what's on the container, what user it will be run as, and what settings/optimizations it can change.