I am trying to set up high availability (HA) for Ray head node. Currently, if Ray head node is down, the Ray job running in this Ray cluster will fail and disappear.
To clarify, I am not using Ray Serve. I am only running some Ray jobs in a Ray cluster.
I deployed my Ray cluster by this KubeRay Helm chart.
Here is my deployment code:
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-hm-ray-cluster
namespace: production-hm-argo-cd
labels:
app.kubernetes.io/name: hm-ray-cluster
spec:
project: production-hm
source:
repoURL: https://ray-project.github.io/kuberay-helm
# https://github.com/ray-project/kuberay/releases
targetRevision: 1.3.0
chart: ray-cluster
helm:
releaseName: hm-ray-cluster
values: |
# https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml
---
image:
tag: 2.43.0-py312-cpu
head:
serviceAccountName: hm-ray-cluster-service-account
autoscalerOptions:
upscalingMode: Default
# Seconds
idleTimeoutSeconds: 300
resources:
requests:
cpu: 1000m
memory: 8Gi
limits:
cpu: 4000m
memory: 128Gi
worker:
replicas: 10
minReplicas: 10
maxReplicas: 100
serviceAccountName: hm-ray-cluster-service-account
resources:
requests:
cpu: 1000m
memory: 8Gi
limits:
cpu: 4000m
memory: 128Gi
destination:
namespace: production-hm-ray-cluster
server: https://kubernetes.default.svc
syncPolicy:
syncOptions:
- ServerSideApply=true
automated:
prune: true
I have read GCS fault tolerance in KubeRay. I feel I need set gcsFaultToleranceOptions
, however, I didn't find how to set it in Helm chart.
Assuming I have a high availability Valkey / Redis cluster, how to set Ray head node in high availability mode using Helm chart?
I saw a similar question posted about 4 years ago at https://discuss.ray.io/t/high-availability-for-head-node-of-ray-clusters/2157, but there was no solution at the time.
Any guide would be appreciate. Thank you!
Confirmed by Ruei An on Ray's Slack, Helm chart does not support setting Ray head node in high availability mode as of today.
I have opened a ticket request at https://github.com/ray-project/kuberay-helm/issues/55 If there is any update in future, I will update this answer.
In the meanwhile, before Helm chart supports it. Here is my Kubernetes YAML file which supports Global Control Service (GCS) fault tolerance using Valkey (similar to Redis):
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: hm-ray-cluster
namespace: production-hm-ray-cluster
labels:
app.kubernetes.io/name: hm-ray-cluster-deployment
app.kubernetes.io/part-of: production-hm-ray-cluster
spec:
rayVersion: 2.43.0
# https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml
gcsFaultToleranceOptions:
redisAddress: redis://hm-ray-cluster-valkey-primary.production-hm-ray-cluster-valkey.svc:6379
redisPassword:
valueFrom:
secretKeyRef:
name: hm-ray-cluster-secret
key: VALKEY_PASSWORD
headGroupSpec:
rayStartParams:
num-cpus: "0"
template:
spec:
serviceAccountName: hm-ray-cluster-service-account
# https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
restartPolicy: Never
containers:
- name: ray-head
image: rayproject/ray:2.43.0-py312-cpu
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
workerGroupSpecs:
- groupName: group-1
replicas: 1
minReplicas: 1
maxReplicas: 100
rayStartParams: {}
template:
spec:
serviceAccountName: hm-ray-cluster-service-account
# https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
restartPolicy: Never
containers:
- name: ray-worker
image: rayproject/ray:2.43.0-py312-cpu
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 1000m
memory: 1Gi
Note enabling Global Control Service (GCS) fault tolerance only makes Ray job history stays after head node after restarting.
However, I found if a job running and Ray head node dies, after delete head node pod and it auto restarts, the job still shows "running", but it is not in healthy state any more.
So I think this suggestion makes sense:
You should think of a Ray cluster as basically flammable. In production scenarios anything that uses Ray should be wrapped in external retries and durable external stores.