Search code examples
raykuberay

How to set Ray head node in high availability mode using KubeRay Helm chart?


I am trying to set up high availability (HA) for Ray head node. Currently, if Ray head node is down, the Ray job running in this Ray cluster will fail and disappear.

To clarify, I am not using Ray Serve. I am only running some Ray jobs in a Ray cluster.

I deployed my Ray cluster by this KubeRay Helm chart.

Here is my deployment code:

---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-hm-ray-cluster
  namespace: production-hm-argo-cd
  labels:
    app.kubernetes.io/name: hm-ray-cluster
spec:
  project: production-hm
  source:
    repoURL: https://ray-project.github.io/kuberay-helm
    # https://github.com/ray-project/kuberay/releases
    targetRevision: 1.3.0
    chart: ray-cluster
    helm:
      releaseName: hm-ray-cluster
      values: |
        # https://github.com/ray-project/kuberay/blob/master/helm-chart/ray-cluster/values.yaml
        ---
        image:
          tag: 2.43.0-py312-cpu
        head:
          serviceAccountName: hm-ray-cluster-service-account
          autoscalerOptions:
            upscalingMode: Default
            # Seconds
            idleTimeoutSeconds: 300
          resources:
            requests:
              cpu: 1000m
              memory: 8Gi
            limits:
              cpu: 4000m
              memory: 128Gi
        worker:
          replicas: 10
          minReplicas: 10
          maxReplicas: 100
          serviceAccountName: hm-ray-cluster-service-account
          resources:
            requests:
              cpu: 1000m
              memory: 8Gi
            limits:
              cpu: 4000m
              memory: 128Gi
  destination:
    namespace: production-hm-ray-cluster
    server: https://kubernetes.default.svc
  syncPolicy:
    syncOptions:
      - ServerSideApply=true
    automated:
      prune: true

I have read GCS fault tolerance in KubeRay. I feel I need set gcsFaultToleranceOptions, however, I didn't find how to set it in Helm chart.

Assuming I have a high availability Valkey / Redis cluster, how to set Ray head node in high availability mode using Helm chart?

I saw a similar question posted about 4 years ago at https://discuss.ray.io/t/high-availability-for-head-node-of-ray-clusters/2157, but there was no solution at the time.

Any guide would be appreciate. Thank you!


Solution

  • Confirmed by Ruei An on Ray's Slack, Helm chart does not support setting Ray head node in high availability mode as of today.

    I have opened a ticket request at https://github.com/ray-project/kuberay-helm/issues/55 If there is any update in future, I will update this answer.


    In the meanwhile, before Helm chart supports it. Here is my Kubernetes YAML file which supports Global Control Service (GCS) fault tolerance using Valkey (similar to Redis):

    apiVersion: ray.io/v1
    kind: RayCluster
    metadata:
      name: hm-ray-cluster
      namespace: production-hm-ray-cluster
      labels:
        app.kubernetes.io/name: hm-ray-cluster-deployment
        app.kubernetes.io/part-of: production-hm-ray-cluster
    spec:
      rayVersion: 2.43.0
      # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml
      gcsFaultToleranceOptions:
        redisAddress: redis://hm-ray-cluster-valkey-primary.production-hm-ray-cluster-valkey.svc:6379
        redisPassword:
          valueFrom:
            secretKeyRef:
              name: hm-ray-cluster-secret
              key: VALKEY_PASSWORD
      headGroupSpec:
        rayStartParams:
          num-cpus: "0"
        template:
          spec:
            serviceAccountName: hm-ray-cluster-service-account
            # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
            restartPolicy: Never
            containers:
              - name: ray-head
                image: rayproject/ray:2.43.0-py312-cpu
                ports:
                  - containerPort: 6379
                    name: gcs
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: serve
                resources:
                  requests:
                    cpu: 1000m
                    memory: 2Gi
                  limits:
                    cpu: 2000m
                    memory: 4Gi
      workerGroupSpecs:
        - groupName: group-1
          replicas: 1
          minReplicas: 1
          maxReplicas: 100
          rayStartParams: {}
          template:
            spec:
              serviceAccountName: hm-ray-cluster-service-account
              # https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.autoscaler-v2.yaml
              restartPolicy: Never
              containers:
                - name: ray-worker
                  image: rayproject/ray:2.43.0-py312-cpu
                  resources:
                    requests:
                      cpu: 1000m
                      memory: 1Gi
                    limits:
                      cpu: 1000m
                      memory: 1Gi
    

    Note enabling Global Control Service (GCS) fault tolerance only makes Ray job history stays after head node after restarting.

    However, I found if a job running and Ray head node dies, after delete head node pod and it auto restarts, the job still shows "running", but it is not in healthy state any more.

    So I think this suggestion makes sense:

    You should think of a Ray cluster as basically flammable. In production scenarios anything that uses Ray should be wrapped in external retries and durable external stores.