kubernetes pending pod priority

I have the following pods on my kubernetes (1.18.3) cluster:

NAME      READY   STATUS    RESTARTS   AGE
pod1      1/1     Running   0          14m
pod2      1/1     Running   0          14m
pod3      0/1     Pending   0          14m
pod4      0/1     Pending   0          14m

pod3 and pod4 cannot start because the node has capacity for 2 pods only. When pod1 finishes and quits, then the scheduler picks either pod3 or pod4 and starts it. So far so good.

However, I also have a high priority pod (hpod) that I'd like to start before pod3 or pod4 when either of the running pods finishes and quits.

So I created a priorityclass can be found in the kubernetes docs:

kind: PriorityClass
metadata:
  name: high-priority-no-preemption
value: 1000000
preemptionPolicy: Never
globalDefault: false
description: "This priority class should be used for XYZ service pods only."

I've created the following pod yaml:

apiVersion: v1
kind: Pod
metadata:
  name: hpod
  labels:
    app: hpod
spec:
  containers:
  - name: hpod
    image: ...
    resources:
      requests:
        cpu: "500m"
        memory: "500Mi"
      limits:
        cpu: "500m"
        memory: "500Mi"
  priorityClassName: high-priority-no-preemption

Now the problem is that when I start the high prio pod with kubectl apply -f hpod.yaml, then the scheduler terminates a running pod to allow the high priority pod to start despite I've set 'preemptionPolicy: Never'.

The expected behaviour would be to postpone starting hpod until a currently running pod finishes. And when it does, then let hpod start before pod3 or pod4.

What am I doing wrong?

Solution

Prerequisites:

This solution was tested on Kubernetes v1.18.3, docker 19.03 and Ubuntu 18. Also text editor is required (i.e. sudo apt-get install vim).

In Kubernetes documentation under How to disable preemption you can find Note:

Note: In Kubernetes 1.15 and later, if the feature NonPreemptingPriority is enabled, PriorityClasses have the option to set preemptionPolicy: Never. This will prevent pods of that PriorityClass from preempting other pods.

Also under Non-preempting PriorityClass you have information:

The use of the PreemptionPolicy field requires the NonPreemptingPriority feature gate to be enabled.

Later if you will check thoses Feature Gates info, you will find that NonPreemptingPriority is false, so as default it's disabled.

Output with your current configuration:

$ kubectl get pods
NAME             READY   STATUS    RESTARTS   AGE
nginx-normal     1/1     Running   0          32s
nginx-normal-2   1/1     Running   0          32s
$ kubectl apply -f prio.yaml
pod/nginx-priority created$ kubectl get pods
NAME             READY   STATUS    RESTARTS   AGE
nginx-normal-2   1/1     Running   0          48s
nginx-priority   1/1     Running   0          8s

To enable preemptionPolicy: Never you need to apply --feature-gates=NonPreemptingPriority=true to 3 files:

/etc/kubernetes/manifests/kube-apiserver.yaml

/etc/kubernetes/manifests/kube-controller-manager.yaml

/etc/kubernetes/manifests/kube-scheduler.yaml

To check if this feature-gate is enabled you can check by using commands:

ps aux | grep apiserver | grep feature-gates
ps aux | grep scheduler | grep feature-gates
ps aux | grep controller-manager | grep feature-gates

For quite detailed information, why you have to edit thoses files please check this Github thread.

$ sudo su
# cd /etc/kubernetes/manifests/
# ls
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml

Use your text editor to add feature gate to those files

# vi kube-apiserver.yaml

and add - --feature-gates=NonPreemptingPriority=true under spec.containers.command like in example bellow:

spec:
  containers:
  - command:
    - kube-apiserver
    - --feature-gates=NonPreemptingPriority=true
    - --advertise-address=10.154.0.31

And do the same with 2 other files. After that you can check if this flags were applied.

$ ps aux | grep apiserver | grep feature-gates
root     26713 10.4  5.2 565416 402252 ?       Ssl  14:50   0:17 kube-apiserver --feature-gates=NonPreemptingPriority=true --advertise-address=10.154.0.31

Now you have redeploy your PriorityClass.

$ kubectl get priorityclass
NAME                          VALUE        GLOBAL-DEFAULT   AGE
high-priority-no-preemption   1000000      false            12m
system-cluster-critical       2000000000   false            23m
system-node-critical          2000001000   false            23m
$ kubectl delete priorityclass high-priority-no-preemption
priorityclass.scheduling.k8s.io "high-priority-no-preemption" deleted
$ kubectl apply -f class.yaml 
priorityclass.scheduling.k8s.io/high-priority-no-preemption created

Last step is to deploy pod with this PriorityClass.

TEST

$ kubectl get po
NAME             READY   STATUS    RESTARTS   AGE
nginx-normal     1/1     Running   0          4m4s
nginx-normal-2   1/1     Running   0          18m
$ kubectl apply -f prio.yaml 
pod/nginx-priority created
$ kubectl get po
NAME             READY   STATUS    RESTARTS   AGE
nginx-normal     1/1     Running   0          5m17s
nginx-normal-2   1/1     Running   0          20m
nginx-priority   0/1     Pending   0          67s
$ kubectl delete po nginx-normal-2
pod "nginx-normal-2" deleted
$ kubectl get po
NAME             READY   STATUS    RESTARTS   AGE
nginx-normal     1/1     Running   0          5m55s
nginx-priority   1/1     Running   0          105s