How to set Ray head node in high availability mode using KubeRay Helm chart?

I am trying to set up high availability (HA) for Ray head node. Currently, if Ray head node is down, the Ray job running in this Ray cluster will fail and disappear.

To clarify, I am not using Ray Serve. I am only running some Ray jobs in a Ray cluster.

I deployed my Ray cluster by this KubeRay Helm chart.

Here is my deployment code:

---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-hm-ray-cluster
  namespace: production-hm-argo-cd
  labels:
    app.kubernetes.io/name: hm-ray-cluster
spec:
  project: production-hm
  source:
    repoURL: https://ray-project.github.io/kuberay-helm
    # https://github.com/ray-project/kuberay/releases
    targetRevision: 1.3.0
    chart: ray-cluster
    helm:
      releaseName: hm-ray-cluster
      values: |
        # https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml
        ---
        image:
          tag: 2.43.0-py312-cpu
        head:
          serviceAccountName: hm-ray-cluster-service-account
          autoscalerOptions:
            upscalingMode: Default
            # Seconds
            idleTimeoutSeconds: 300
          resources:
            requests:
              cpu: 1000m
              memory: 8Gi
            limits:
              cpu: 4000m
              memory: 128Gi
        worker:
          replicas: 10
          minReplicas: 10
          maxReplicas: 100
          serviceAccountName: hm-ray-cluster-service-account
          resources:
            requests:
              cpu: 1000m
              memory: 8Gi
            limits:
              cpu: 4000m
              memory: 128Gi
  destination:
    namespace: production-hm-ray-cluster
    server: https://kubernetes.default.svc
  syncPolicy:
    syncOptions:
      - ServerSideApply=true
    automated:
      prune: true

I have read GCS fault tolerance in KubeRay. I feel I need set gcsFaultToleranceOptions, however, I didn't find how to set it in Helm chart.

Assuming I have a high availability Valkey / Redis cluster, how to set Ray head node in high availability mode using Helm chart?

I saw a similar question posted about 4 years ago at https://discuss.ray.io/t/high-availability-for-head-node-of-ray-clusters/2157, but there was no solution at the time.

Any guide would be appreciate. Thank you!

Solution

Confirmed by Ruei An on Ray's Slack, Helm chart does not support setting Ray head node in high availability mode as of today.

I have opened a ticket request at https://github.com/ray-project/kuberay-helm/issues/55 If there is any update in future, I will update this answer.

In the meanwhile, before Helm chart supports it. Here is my Kubernetes YAML file which supports Global Control Service (GCS) fault tolerance using Valkey (similar to Redis):

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: hm-ray-cluster
  namespace: production-hm-ray-cluster
  labels:
    app.kubernetes.io/name: hm-ray-cluster-deployment
    app.kubernetes.io/part-of: production-hm-ray-cluster
spec:
  rayVersion: 2.43.0
  # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml
  gcsFaultToleranceOptions:
    redisAddress: redis://hm-ray-cluster-valkey-primary.production-hm-ray-cluster-valkey.svc:6379
    redisPassword:
      valueFrom:
        secretKeyRef:
          name: hm-ray-cluster-secret
          key: VALKEY_PASSWORD
  headGroupSpec:
    rayStartParams:
      num-cpus: "0"
    template:
      spec:
        serviceAccountName: hm-ray-cluster-service-account
        # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
        restartPolicy: Never
        containers:
          - name: ray-head
            image: rayproject/ray:2.43.0-py312-cpu
            ports:
              - containerPort: 6379
                name: gcs
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
              - containerPort: 8000
                name: serve
            resources:
              requests:
                cpu: 1000m
                memory: 2Gi
              limits:
                cpu: 2000m
                memory: 4Gi
  workerGroupSpecs:
    - groupName: group-1
      replicas: 1
      minReplicas: 1
      maxReplicas: 100
      rayStartParams: {}
      template:
        spec:
          serviceAccountName: hm-ray-cluster-service-account
          # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
          restartPolicy: Never
          containers:
            - name: ray-worker
              image: rayproject/ray:2.43.0-py312-cpu
              resources:
                requests:
                  cpu: 1000m
                  memory: 1Gi
                limits:
                  cpu: 1000m
                  memory: 1Gi

Note enabling Global Control Service (GCS) fault tolerance only makes Ray job history stays after head node after restarting.

However, I found if a job running and Ray head node dies, after delete head node pod and it auto restarts, the job still shows "running", but it is not in healthy state any more.

So I think this suggestion makes sense:

You should think of a Ray cluster as basically flammable. In production scenarios anything that uses Ray should be wrapped in external retries and durable external stores.