Search code examples
kubernetespreemption

kubernetes pending pod priority


I have the following pods on my kubernetes (1.18.3) cluster:

NAME      READY   STATUS    RESTARTS   AGE
pod1      1/1     Running   0          14m
pod2      1/1     Running   0          14m
pod3      0/1     Pending   0          14m
pod4      0/1     Pending   0          14m

pod3 and pod4 cannot start because the node has capacity for 2 pods only. When pod1 finishes and quits, then the scheduler picks either pod3 or pod4 and starts it. So far so good.

However, I also have a high priority pod (hpod) that I'd like to start before pod3 or pod4 when either of the running pods finishes and quits.

So I created a priorityclass can be found in the kubernetes docs:

kind: PriorityClass
metadata:
  name: high-priority-no-preemption
value: 1000000
preemptionPolicy: Never
globalDefault: false
description: "This priority class should be used for XYZ service pods only."

I've created the following pod yaml:

apiVersion: v1
kind: Pod
metadata:
  name: hpod
  labels:
    app: hpod
spec:
  containers:
  - name: hpod
    image: ...
    resources:
      requests:
        cpu: "500m"
        memory: "500Mi"
      limits:
        cpu: "500m"
        memory: "500Mi"
  priorityClassName: high-priority-no-preemption

Now the problem is that when I start the high prio pod with kubectl apply -f hpod.yaml, then the scheduler terminates a running pod to allow the high priority pod to start despite I've set 'preemptionPolicy: Never'.

The expected behaviour would be to postpone starting hpod until a currently running pod finishes. And when it does, then let hpod start before pod3 or pod4.

What am I doing wrong?


Solution

  • Prerequisites:

    This solution was tested on Kubernetes v1.18.3, docker 19.03 and Ubuntu 18. Also text editor is required (i.e. sudo apt-get install vim).

    In Kubernetes documentation under How to disable preemption you can find Note:

    Note: In Kubernetes 1.15 and later, if the feature NonPreemptingPriority is enabled, PriorityClasses have the option to set preemptionPolicy: Never. This will prevent pods of that PriorityClass from preempting other pods.

    Also under Non-preempting PriorityClass you have information:

    The use of the PreemptionPolicy field requires the NonPreemptingPriority feature gate to be enabled.

    Later if you will check thoses Feature Gates info, you will find that NonPreemptingPriority is false, so as default it's disabled.

    Output with your current configuration:

    $ kubectl get pods
    NAME             READY   STATUS    RESTARTS   AGE
    nginx-normal     1/1     Running   0          32s
    nginx-normal-2   1/1     Running   0          32s
    $ kubectl apply -f prio.yaml
    pod/nginx-priority created$ kubectl get pods
    NAME             READY   STATUS    RESTARTS   AGE
    nginx-normal-2   1/1     Running   0          48s
    nginx-priority   1/1     Running   0          8s
    

    To enable preemptionPolicy: Never you need to apply --feature-gates=NonPreemptingPriority=true to 3 files:

    /etc/kubernetes/manifests/kube-apiserver.yaml

    /etc/kubernetes/manifests/kube-controller-manager.yaml

    /etc/kubernetes/manifests/kube-scheduler.yaml

    To check if this feature-gate is enabled you can check by using commands:

    ps aux | grep apiserver | grep feature-gates
    ps aux | grep scheduler | grep feature-gates
    ps aux | grep controller-manager | grep feature-gates
    

    For quite detailed information, why you have to edit thoses files please check this Github thread.

    $ sudo su
    # cd /etc/kubernetes/manifests/
    # ls
    etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml
    

    Use your text editor to add feature gate to those files

    # vi kube-apiserver.yaml
    

    and add - --feature-gates=NonPreemptingPriority=true under spec.containers.command like in example bellow:

    spec:
      containers:
      - command:
        - kube-apiserver
        - --feature-gates=NonPreemptingPriority=true
        - --advertise-address=10.154.0.31
    

    And do the same with 2 other files. After that you can check if this flags were applied.

    $ ps aux | grep apiserver | grep feature-gates
    root     26713 10.4  5.2 565416 402252 ?       Ssl  14:50   0:17 kube-apiserver --feature-gates=NonPreemptingPriority=true --advertise-address=10.154.0.31 
    

    Now you have redeploy your PriorityClass.

    $ kubectl get priorityclass
    NAME                          VALUE        GLOBAL-DEFAULT   AGE
    high-priority-no-preemption   1000000      false            12m
    system-cluster-critical       2000000000   false            23m
    system-node-critical          2000001000   false            23m
    $ kubectl delete priorityclass high-priority-no-preemption
    priorityclass.scheduling.k8s.io "high-priority-no-preemption" deleted
    $ kubectl apply -f class.yaml 
    priorityclass.scheduling.k8s.io/high-priority-no-preemption created
    

    Last step is to deploy pod with this PriorityClass.

    TEST

    $ kubectl get po
    NAME             READY   STATUS    RESTARTS   AGE
    nginx-normal     1/1     Running   0          4m4s
    nginx-normal-2   1/1     Running   0          18m
    $ kubectl apply -f prio.yaml 
    pod/nginx-priority created
    $ kubectl get po
    NAME             READY   STATUS    RESTARTS   AGE
    nginx-normal     1/1     Running   0          5m17s
    nginx-normal-2   1/1     Running   0          20m
    nginx-priority   0/1     Pending   0          67s
    $ kubectl delete po nginx-normal-2
    pod "nginx-normal-2" deleted
    $ kubectl get po
    NAME             READY   STATUS    RESTARTS   AGE
    nginx-normal     1/1     Running   0          5m55s
    nginx-priority   1/1     Running   0          105s