kubernetes google-cloud-platform google-kubernetes-engine google-cloud-endpoints istio

Istio enabled GKE cluster not reliably communicating with Google Service Infrastructure APIs

I have been unable to reliably allow my istio enabled Google Kubernetes Engine cluster to connect to Google Cloud Endpoints (service management API) via the extensible service proxy. When I deploy my Pods the proxy will always fail to startup causing the Pod to be restarted, and output the following error:

INFO:Fetching an access token from the metadata service
WARNING:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fea4abece90>: Failed to establish a new connection: [Errno 111] Connection refused',)': /computeMetadata/v1/instance/service-accounts/default/token
ERROR:Failed fetching metadata attribute: http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token

However after restarting, the proxy reports everything is fine, it was able to grab an access token and I am able to make requests to the Pod successfully:

INFO:Fetching an access token from the metadata service
INFO:Fetching the service config ID from the rollouts service
INFO:Fetching the service configuration from the service management service
INFO:Attribute zone: europe-west2-a
INFO:Attribute project_id: my-project
INFO:Attribute kube_env: KUBE_ENV
nginx: [warn] Using trusted CA certificates file: /etc/nginx/trusted-ca-certificates.crt
10.154.0.5 - - [23/May/2020:21:19:36 +0000] "GET /domains HTTP/1.1" 200 221 "-" "curl/7.58.0"

After about an hour, presumably because the access token has expired, the proxy logs indicate that it was again unable to fetch an access token and I can no longer make requests to my Pod.

2020/05/23 22:14:04 [error] 9#9: upstream timed out (110: Connection timed out)
2020/05/23 22:14:04[error]9#9: Failed to fetch service account token
2020/05/23 22:14:04[error]9#9: Fetch access token unexpected status: INTERNAL: Failed to fetch service account token

I have in place a ServiceEntry resource that should be allowing the proxy to make requests to the metadata server on the GKE node:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: google-metadata-server
spec:
  hosts:
  - metadata.google.internal # GCE metadata server
  addresses:
  - 169.254.169.254 # GCE metadata server
  location: MESH_EXTERNAL
  ports:
  - name: http
    number: 80
    protocol: HTTP
  - name: https
    number: 443
    protocol: HTTPS

I have confirmed this is working by execing into one of the containers and running:

curl -H "Metadata-Flavor: Google" http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token

How can I prevent this behaviour and reliably have the proxy communicate with the Google Service Infrastructure APIs?

Solution

Although I am not entirely convinced this is the solution it appears that using a dedicated service account to generate access tokens within the extensible service proxy container prevents the behaviour reported above from happening, and I am able to reliably make requests to the proxy and upstream service, even after an hour.

The service account I am using has the following roles:

roles/cloudtrace.agent
roles/servicemanagement.serviceController

Assuming this is a stable solution to the problem I am much happier with this as an outcome because I am not 100% comfortable using the metadata server since it relies on the service account associated with the GKE node. This service account is often more powerful that it needs to be for ESP to do its job.

I will however be continuing to monitor this just in case the proxy upstream becomes unreachable again.