kubernetes google-kubernetes-engine autopilot

With GKE Autopilot banning the cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation, is there a way to ensure job pods do not get evicted?

Our GKE Autopilot cluster was recently upgraded to version 1.21.6-gke.1503, which apparently causes the cluster-autoscaler.kubernetes.io/safe-to-evict=false annotation to be banned.

I totally get this for deployments, as Google doesn't want a deployment preventing scale-down, but for jobs I'd argue this annotation makes perfect sense in certain cases. We start complex jobs that start and monitor other jobs themselves, which makes it hard to make them restart-resistant given the sheer number of moving parts.

Is there any way to make it as unlikely as possible for job pods to be restarted/moved around when using Autopilot? Prior to switching to Autopilot, we used to make sure our jobs filled a single node by requesting all of its available resources; combined with a Guaranteed QoS class, this made sure the only way for a pod to be evicted was if the node somehow failed, which almost never happened. Now all we seem to have left is the Guaranteed QoS class, but that doesn't prevent pods from being evicted.

Solution

This is now supported in GKE Autopilot, from 1.27+.

cluster-autoscaler.kubernetes.io/safe-to-evict=false will prevent GKE-initiated disruption to the Pod for 7 days (including auto-scaling related, and update related disruption).