In Cloud composer1, dag last task fail with incomplete log on daily schedule at 1:00 am on 24th march 2022 but same dag ran successfully on other days

Below is the pic of failed dag with last task incomplete log in google cloud composer version composer-1.17.8-airflow-2.1.4 on 24th march 2022 Failed dag last task incomplete log

Next day, Below is the pic of successful dag with last task complete log in google cloud composer version composer-1.17.8-airflow-2.1.4 on 25th march 2022 without any modification in dag code and other dependencies files Successful dag last task complete log

Could you please let us know the exact reason why cloud composer 1 behaviour was different on 24th march 2022 because all other days the same dag with same code and dependencies file on scheduled run at 1:00 am was successful?

On the manual run of same failed dags, the result was successful. Also kept the dag with same code and dependencies file for observation/monitoring next few days on scheduled run. The result was successful. Need more clarity on 24th march 2022 failed dag having last task incomplete log

Solution

Incomplete logs often means the Airflow worker pod was evicted, which is usually when a node in a Kubernetes cluster is running out of memory or disk, it activates a flag signaling that it is under pressure. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed an evicted pod (GKE > Workloads > "airflow-worker").

You will probably see in "Tasks Instances" that said tasks have no worker (Hostname) assigned, which, added to incomplete logs, is a proof of the death of the pod

Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine. The composer issues only occur either at task instance level or network level. Our composer issue was at the task instance level which we've identified now.

High CPU usage is often the root cause of Worker Pod evictions. If there is very high usage, one should consider scaling out the Composer environment or changing the schedule of DAG runs.