We have an autopilot cluster in GKE. Sometimes, our pods simply get terminated, with no explanation. We suspect that k8s is preempting our pods - we only have one DAG running on a daily schedule in this cluster, but it tries to run a number of tasks simultaneously and we think that if there aren't enough resources, k8s preempts an existing pod to start another.
Is there a way to test for this? Is there a way to configure GKE/k8s to be a little more patient when waiting for resources?
After some discussion within the team and also with a Google support engineer, we added some "warm-up" tasks to our DAG. These tasks are just simple Python tasks that sleep for some period of time (6 minutes seems to be just enough time) so that the cluster can wake up and start running its own pods. If it needs to preempt something, it preempts a warm-up task, and that's OK.
Since implementing this, we haven't had any real tasks get preempted.