Search code examples
google-cloud-platformairflowgoogle-kubernetes-enginegoogle-cloud-composer

Airflow instance in GCP Composer Webserver / Scheduler down


We have Airflow installed using GCP composer and all of us sudden Webserver / scheduler went down.

We just tried restarting by updating some dummy variables or worker nodes and always we are getting below error

Error: UPDATE operation on this environment failed 1 hour ago with the following error message:
Operation failed. Couldn't start composer-agent, a GKE job that updates kubernetes resources. Please check if your GKE cluster exists, is healthy and contains non-empty 'default-pool' node pool.

Any suggestion as our environment was complete stuck


Solution

  • Based on my understanding, it appears that the composer agent pod is unable to pull the container images because of incorrect DNS records for private Google access to *.pkg.dev. I believe you would only have records for *.gcr.io, as this is where the images were previously hosted more info here