Search code examples
kubernetesopenshiftkubernetes-statefulset

Stateful set of Cassandra cluster pod are not scheduled to same node in case of restart


I have 20 worker nodes cassandra cluster in kubernetes where each pod scheduled in seperate worker node, when restart happened to multiple pods in case of threshold limit reached like cpu/Memory then pods are not scheduling to same worker node everytime. To resolved that I have to kill each pod forcefully and once all killed then I have to change replica back to normal

Is there any possible solution that I can fix pod scheduling to the same node every time. In statefulset pod name are fixed every time.

I tried to set nodeaffnity some how it was not executed becaue of statefulset . I am using deployment type to schedule statefulset.


Solution

  • I'd advise against forcing pods to the same node every time in a Kubernetes as it can cause issues when nodes go down for whatever reason and now the pods can't be scheduled because you are trying to force them to get deployed on a machine that is down.

    Caveats out of the way, There are two ways to do this:

    1. Use Daemonset
    2. Use pod anti affinity and Node Selector

    Daemonset:

    With a Daemonset Pods are guaranteed to be unique per node. The only down side is that you lose the predictable pod names and the predictable replacement of pods

    Scheduling with NodeSelector and PodAntiAffinity:

    Define a nodeSelector in your StatefulSet spec that specifies desired node labels (e.g., hardware type, storage capacity). This influences scheduling but doesn't guarantee placement. Use podAntiAffinity with a preferredDuringScheduling pod topology spread constraint:

    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: <app-name>
              operator: In
              values:
              - <your-app-name>
          topologyKey: "kubernetes.io/hostname"
    
    

    This encourages scheduling pods on different nodes with the same app name label, but allows flexibility if preferred nodes are unavailable.