Search code examples
google-cloud-platformgoogle-kubernetes-engineendpointgrpc-pythonworkload-identity

403 Forbidden on ESPv2, GKE AutoPilot, WIF


I'm following the Getting started with Endpoints for GKE with ESPv2. I'm using Workload Identity Federation and Autopilot on the GKE cluster.

I've been running into the error:

F0110 03:46:24.304229 8 server.go:54] fail to initialize config manager: http call to GET https://servicemanagement.googleapis.com/v1/services/name:bookstore.endpoints.<project>.cloud.goog/rollouts?filter=status=SUCCESS returns not 200 OK: 403 Forbidden

Which ultimately leads to a transport failure error and shut down of the Pod.

My first step was to investigate permission issues, but I could really use some outside perspective on this as I've been going around in circles on this.

Here's my config:

>> gcloud container clusters describe $GKE_CLUSTER_NAME \
--zone=$GKE_CLUSTER_ZONE \
--format='value[delimiter="\n"](nodePools[].config.oauthScopes)'
['https://www.googleapis.com/auth/devstorage.read_only', 
'https://www.googleapis.com/auth/logging.write', 
'https://www.googleapis.com/auth/monitoring', 
'https://www.googleapis.com/auth/service.management.readonly', 
'https://www.googleapis.com/auth/servicecontrol', 
'https://www.googleapis.com/auth/trace.append']

>> gcloud container clusters describe $GKE_CLUSTER_NAME \
--zone=$GKE_CLUSTER_ZONE \
--format='value[delimiter="\n"](nodePools[].config.serviceAccount)'
default
default

Service-Account-Name: test-espv2

Roles

Cloud Trace Agent
Owner
Service Account Token Creator
Service Account User
Service Controller
Workload Identity User

I've associated the WIF svc-act with the Cluster with the following yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    iam.gke.io/gcp-service-account: test-espv2@<project>.iam.gserviceaccount.com
  name: test-espv2
  namespace: eventing

And then I've associated the pod with the test-espv2 svc-act

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: esp-grpc-bookstore
  namespace: eventing
spec:
  replicas: 1
  selector:
    matchLabels:
      app: esp-grpc-bookstore
  template:
    metadata:
      labels:
        app: esp-grpc-bookstore
    spec:
      serviceAccountName: test-espv2

Since the gcr.io/endpoints-release/endpoints-runtime:2 is limited, I created a test container and deployed it into the same eventing namespace.

Within the container, I'm able to retrieve the endpoint service config with the following command:

curl --fail -o "service.json" -H "Authorization: Bearer $(gcloud auth print-access-token)" \
 "https://servicemanagement.googleapis.com/v1/services/${SERVICE}/configs/${CONFIG_ID}?view=FULL" 

And also within the container, I'm running as the impersonated service account, tested with:

curl -H "Metadata-Flavor: Google" http://169.254.169.254/computeMetadata/v1/instance/service-accounts/

Are there any other tests I can run to help me debug this issue?

Thanks in advance,


Solution

  • I've finally figured out the issue. It was in 2 parts.

    1. Redeployment of app, paying special attention and verification of the kubectl annotate serviceaccount commands
      • add-iam-policy-binding for both serviceController and cloudtrace.agent
      • omitting nodeSelector: iam.gke.io/gke-metadata-server-enabled: "true" due to Autopilot

    Doing this enabled a successful kube deployment as displayed by the logs.

    Next error I had was

    <h1>Error: Server Error</h1>
    <h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>
    
    1. This was fixed by turning my attention back to my Kube cluster. Looking through the events in my ingress service, since I was in a shared-vpc and my security policies only allowed firewall management from the host project, the deployment was failing to update the firewall rules.

    Manually provisioning them, as shown here :

    https://cloud.google.com/kubernetes-engine/docs/concepts/ingress#manually_provision_firewall_rules_from_the_host_project

    solved my issues.