kubernetes google-kubernetes-engine daemonset

Kubernetes is faling to schedule Daemonset pods on nodes in an auto scaling GKE node pool

We are seeing an issue with the GKE kubernetes scheduler being unable or unwilling to schedule Daemonset pods on nodes in an auto scaling node pool.

We have three node pools in the cluster, however the pool-x pool is used to exclusively schedule a single Deployment backed by an HPA, with the nodes having the taint "node-use=pool-x:NoSchedule" applied to them in this pool. We have also deployed a filebeat Daemonset on which we have set a very lenient tolerations spec of operator: Exists (hopefully this is correct) set to ensure the Daemonset is scheduled on every node in the cluster.

Our assumption is that, as pool-x is auto-scaled up, the filebeat Daemonset would be scheduled on the node prior to scheduling any of the pods assigned to on that node. However, we are noticing that as new nodes are added to the pool, the filebeat pods are failing to be placed on the node and are in a perpetual "Pending" state. Here is an example of the describe output (truncated) of one the filebeat Daemonset:

Number of Nodes Scheduled with Up-to-date Pods: 108
Number of Nodes Scheduled with Available Pods: 103
Number of Nodes Misscheduled: 0
Pods Status:  103 Running / 5 Waiting / 0 Succeeded / 0 Failed

And the events for one of the "Pending" filebeat pods:

Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   18m (x96 over 68m)      default-scheduler   0/106 nodes are available: 105 node(s) didn't match node selector, 5 Insufficient cpu.
  Normal   NotTriggerScaleUp  3m56s (x594 over 119m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 6 node(s) didn't match node selector
  Warning  FailedScheduling   3m14s (x23 over 15m)    default-scheduler   0/108 nodes are available: 107 node(s) didn't match node selector, 5 Insufficient cpu.

As you can see, the node does not have enough resources to schedule the filebeat pod CPU requests are exhausted due to the other pods running on the node. However, why is the Daemonset pod not being placed on the node prior to scheduling any other pods. Seems like the very definition of a Daemonset necessitates priority scheduling.

Also of note, if I delete a pod on a node where filebeat is "Pending" scheduling due to being unable to satisfy the CPU requests, filebeat is immediately scheduled on that node, indicating that there is some scheduling precedence being observed.

Ultimately, we just want to ensure the filebeat Daemonset is able to schedule a pod on every single node in the cluster and have that priority work nicely with our cluster autoscaling and HPAs. Any ideas on how we can achieve this?

We'd like to avoid having to use Pod Priority, as its apparently an alpha feature in GKE and we are unable to make use of it at this time.

Solution

The behavior you are expecting of the DaemonSet pods being scheduled first on a node is no longer the reality (since 1.12). Since 1.12, DaemonSet pods are handled by the default scheduler and relies on pod priority to determine the order in which pods are scheduled. You may want to consider creating a priorityCLass specific for DaemonSets with a relatively high value to ensure they are scheduled ahead of most of your other pods.