kubernetes google-kubernetes-engine gcloud

GKE Metadata Server is unavailable when Horizental Pod Auto Scaler start to work

Running Pods with WorkloadIdentity makes an Google Credential error when auto scaling started.

My application is configured with WorkloadIdentity to use Google Pub/Sub and also set HorizontalPodAutoscaler to scale the pods up to 5 replicas.

The problem arises when an auto scaler create replicas of the pod, GKE's metadata server does not work for few seconds then after 5 to 10 seconds no error created.

here is the error log after a pod created by auto scaler.

WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable onattempt 1 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable onattempt 2 of 3. Reason: timed out    
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable onattempt 3 of 3. Reason: timed out
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server
Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started

what exactly is the problem here?

When I read the doc from here workload identity docs

"The GKE metadata server takes a few seconds to start to run on a newly created Pod"

I think the problem is related to this issue but is there a solution for this kind situation?

Thanks

Solution

There is no specific solution other than to ensure your application can cope with this. Kubernetes uses DaemonSets to launch per-node apps like the metadata intercept server but as the docs clearly tell you, that takes a few seconds (noticing the new node, scheduling the pod, pulling the image, starting the container).

You can use an initContainer to prevent your application from starting until some script returns, which can just try to hit a GCP API until it works. But that's probably more work than just making your code retry when those errors happen.