google-kubernetes-engine grpc-web autopilot

How to configure GKE Autopilot w/Envoy & gRPC-Web

I have an application running on my local machine that uses React -> gRPC-Web -> Envoy -> Go app and everything runs with no problems. I'm trying to deploy this using GKE Autopilot and I just haven't been able to get the configuration right. I'm new to all of GCP/GKE, so I'm looking for help to figure out where I'm going wrong.

I was following this doc initially, even though I only have one gRPC service: https://cloud.google.com/architecture/exposing-grpc-services-on-gke-using-envoy-proxy

From what I've read, GKE Autopilot mode requires using External HTTP(s) load balancing instead of Network Load Balancing as described in the above solution, so I've been trying to get that to work. After a variety of attempts, my current strategy has an Ingress, BackendConfig, Service, and Deployment. The deployment has three containers: my app, an Envoy sidecar to transform the gRPC-Web requests and responses, and a cloud SQL proxy sidecar. I eventually want to be using TLS, but for now, I left that out so it wouldn't complicate things even more.

When I apply all of the configs, the backend service shows one backend in one zone and the health check fails. The health check is set for port 8080 and path /healthz which is what I think I've specified in the deployment config, but I'm suspicious because when I look at the details for the envoy-sidecar container, it shows the Readiness probe as: http-get HTTP://:0/healthz headers=x-envoy-livenessprobe:healthz. Does ":0" just mean it's using the default address and port for the container, or does indicate a config problem?

I've been reading various docs and just haven't been able to piece it all together. Is there an example somewhere that shows how this can be done? I've been searching and haven't found one.

My current configs are:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grammar-games-ingress
  #annotations:
    # If the class annotation is not specified it defaults to "gce".
    # kubernetes.io/ingress.class: "gce"
    # kubernetes.io/ingress.global-static-ip-name: <IP addr>
spec:
  defaultBackend:
    service:
      name: grammar-games-core
      port:
        number: 80
---
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: grammar-games-bec
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
spec:
  sessionAffinity:
    affinityType: "CLIENT_IP"  
  healthCheck:
    checkIntervalSec: 15
    port: 8080
    type: HTTP
    requestPath: /healthz
  timeoutSec: 60
---
apiVersion: v1
kind: Service
metadata:
  name: grammar-games-core
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
    cloud.google.com/app-protocols: '{"http":"HTTP"}'
    cloud.google.com/backend-config: '{"default": "grammar-games-bec"}'
spec:
  type: ClusterIP
  selector:
    app: grammar-games-core
  ports:
  - name: http
    protocol: TCP
    port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grammar-games-core
spec:
  # Two replicas for right now, just so I can see how RPC calls get directed.
  # replicas: 2
  selector:
    matchLabels:
      app: grammar-games-core
  template:
    metadata:
      labels:
        app: grammar-games-core
    spec:
      serviceAccountName: grammar-games-core-k8sa
      containers:
      - name: grammar-games-core
        image: gcr.io/grammar-games/grammar-games-core:1.1.2
        command:
          - "/bin/grammar-games-core"
        ports:
        - containerPort: 52001
        env:
        - name: GAMESDB_USER
          valueFrom:
            secretKeyRef:
              name: gamesdb-config
              key: username
        - name: GAMESDB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: gamesdb-config
              key: password
        - name: GAMESDB_DB_NAME
          valueFrom:
            secretKeyRef:
              name: gamesdb-config
              key: db-name 
        - name: GRPC_SERVER_PORT
          value: '52001'
        - name: GAMES_LOG_FILE_PATH
          value: ''
        - name: GAMESDB_LOG_LEVEL
          value: 'debug'
        resources:
          requests:
            # The proxy's memory use scales linearly with the number of active
            # connections. Fewer open connections will use less memory. Adjust
            # this value based on your application's requirements.
            memory: "2Gi"
            # The proxy's CPU use scales linearly with the amount of IO between
            # the database and the application. Adjust this value based on your
            # application's requirements.
            cpu:    "1"
        readinessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:52001"]
          initialDelaySeconds: 5
      - name: cloud-sql-proxy
        # It is recommended to use the latest version of the Cloud SQL proxy
        # Make sure to update on a regular schedule!
        image: gcr.io/cloudsql-docker/gce-proxy:1.24.0
        command:
          - "/cloud_sql_proxy"

          # If connecting from a VPC-native GKE cluster, you can use the
          # following flag to have the proxy connect over private IP
          # - "-ip_address_types=PRIVATE"

          # Replace DB_PORT with the port the proxy should listen on
          # Defaults: MySQL: 3306, Postgres: 5432, SQLServer: 1433
          - "-instances=grammar-games:us-east1:grammar-games-db=tcp:3306"
        securityContext:
          # The default Cloud SQL proxy image runs as the
          # "nonroot" user and group (uid: 65532) by default.
          runAsNonRoot: true
        # Resource configuration depends on an application's requirements. You
        # should adjust the following values based on what your application
        # needs. For details, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
        resources:
          requests:
            # The proxy's memory use scales linearly with the number of active
            # connections. Fewer open connections will use less memory. Adjust
            # this value based on your application's requirements.
            memory: "2Gi"
            # The proxy's CPU use scales linearly with the amount of IO between
            # the database and the application. Adjust this value based on your
            # application's requirements.
            cpu:    "1"
      - name: envoy-sidecar
        image: envoyproxy/envoy:v1.20-latest
        ports:
        - name: http
          containerPort: 8080
        resources:
          requests:
            cpu: 10m
            ephemeral-storage: 256Mi
            memory: 256Mi
        volumeMounts:
        - name: config
          mountPath: /etc/envoy
        readinessProbe:
          httpGet:
            port: http
            httpHeaders:
            - name: x-envoy-livenessprobe
              value: healthz
            path: /healthz
            scheme: HTTP
      volumes:
      - name: config
        configMap:
          name: envoy-sidecar-conf      
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-sidecar-conf
data:
  envoy.yaml: |
    static_resources:
      listeners:
      - name: listener_0
        address:
          socket_address:
            address: 0.0.0.0
            port_value: 8080
        filter_chains:
        - filters:
          - name: envoy.filters.network.http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
              access_log:
              - name: envoy.access_loggers.stdout
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
              codec_type: AUTO
              stat_prefix: ingress_http
              route_config:
                name: local_route
                virtual_hosts:
                - name: http
                  domains:
                  - "*"
                  routes:
                  - match:
                      prefix: "/grammar_games_protos.GrammarGames/"
                    route:
                      cluster: grammar-games-core-grpc
                  cors:
                    allow_origin_string_match:
                    - prefix: "*"
                    allow_methods: GET, PUT, DELETE, POST, OPTIONS
                    allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout
                    max_age: "1728000"
                    expose_headers: custom-header-1,grpc-status,grpc-message
              http_filters:
              - name: envoy.filters.http.health_check
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
                  pass_through_mode: false
                  headers:
                  - name: ":path"
                    exact_match: "/healthz"
                  - name: "x-envoy-livenessprobe"
                    exact_match: "healthz"
              - name: envoy.filters.http.grpc_web
              - name: envoy.filters.http.cors
              - name: envoy.filters.http.router
                typed_config: {}
      clusters:
      - name: grammar-games-core-grpc
        connect_timeout: 0.5s
        type: logical_dns
        lb_policy: ROUND_ROBIN
        http2_protocol_options: {}
        load_assignment:
          cluster_name: grammar-games-core-grpc
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: 0.0.0.0
                    port_value: 52001
        health_checks:
          timeout: 1s
          interval: 10s
          unhealthy_threshold: 2
          healthy_threshold: 2
          grpc_health_check: {}
    admin:
      access_log_path: /dev/stdout
      address:
        socket_address:
          address: 127.0.0.1
          port_value: 8090

Solution

I've finally gotten through this issue, so wanted to post the answer I have for reference.

Turns out that the solution in this document works:

https://cloud.google.com/architecture/exposing-grpc-services-on-gke-using-envoy-proxy#introduction

Somewhere in one of the docs about the GKE autopilot mode, I got the impression that you can't use a Network Load Balancer, and instead, need to use the Ingress for HTTP(S) Load Balancing. That's why I was pursuing the other approach, but even after working with Google support for several weeks, the configs all looked correct, but the healthcheck from the load balancer just would not work correctly. That's when we figured out that this solution with the Network Load Balancer actually will work.

I also had some issues with getting https/TLS configured. That turned out to be an issue in my envoy config file.

I still have one remaining issue with the stability of the pods, but that's a separate issue that I'll pursue in a different post/question. As long as I only ask for 1 replica, the solution is stable and working well and autopilot is supposed to scale up the pods as necessary.

I know the config for all of this can all be very tricky, so I'm including it all here for reference (just using my-app as a placeholder). Hopefully it will help someone else get there faster than I did! I think it's a great solution with gRPC-Web once you can get it working. You'll also notice that I'm using the cloud-sql-proxy sidecar to talk to the DB Cloud SQL and I'm using a GKE service account for authentication.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      serviceAccountName: my-app-k8sa
      terminationGracePeriodSeconds: 30
      containers:
      - name: my-app
        image: gcr.io/my-project/my-app:1.1.0
        command:
          - "/bin/my-app"
        ports:
        - containerPort: 52001
        env:
        - name: DB_USER
          valueFrom:
            secretKeyRef:
              name: db-config
              key: username
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-config
              key: password
        - name: DB_NAME
          valueFrom:
            secretKeyRef:
              name: db-config
              key: db-name 
        - name: GRPC_SERVER_PORT
          value: '52001'
        readinessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:52001"]
          initialDelaySeconds: 10
        livenessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:52001"]
          initialDelaySeconds: 15
      - name: cloud-sql-proxy
        # It is recommended to use the latest version of the Cloud SQL proxy
        # Make sure to update on a regular schedule!
        image: gcr.io/cloudsql-docker/gce-proxy:1.27.1
        command:
          - "/cloud_sql_proxy"

          # If connecting from a VPC-native GKE cluster, you can use the
          # following flag to have the proxy connect over private IP
          # - "-ip_address_types=PRIVATE"

          # Replace DB_PORT with the port the proxy should listen on
          # Defaults: MySQL: 3306, Postgres: 5432, SQLServer: 1433
          - "-instances=my-project:us-east1:my-app-db=tcp:3306"
        securityContext:
          # The default Cloud SQL proxy image runs as the
          # "nonroot" user and group (uid: 65532) by default.
          runAsNonRoot: true

---
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  type: ClusterIP
  selector:
    app: my-app
  ports:
  - name: my-app-port
    protocol: TCP
    port: 52001
  clusterIP: None
---
apiVersion: v1
kind: Service
metadata:
  name: envoy
spec:
  type: LoadBalancer
  selector:
    app: envoy
  ports:
  - name: https
    protocol: TCP
    port: 443
    targetPort: 8443
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: envoy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: envoy
  template:
    metadata:
      labels:
        app: envoy
    spec:
      containers:
      - name: envoy
        image: envoyproxy/envoy:v1.20-latest
        ports:
        - name: https
          containerPort: 8443
        resources:
          requests:
            cpu: 10m
            ephemeral-storage: 256Mi
            memory: 256Mi
        volumeMounts:
        - name: config
          mountPath: /etc/envoy
        - name: certs
          mountPath: /etc/ssl/envoy
        readinessProbe:
          httpGet:
            port: https
            httpHeaders:
            - name: x-envoy-livenessprobe
              value: healthz
            path: /healthz
            scheme: HTTPS
      volumes:
      - name: config
        configMap:
          name: envoy-conf
      - name: certs
        secret:
          secretName: envoy-certs
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-conf
data:
  envoy.yaml: |
    static_resources:
      listeners:
      - name: listener_0
        address:
          socket_address:
            address: 0.0.0.0
            port_value: 8443
        filter_chains:
        - filters:
          - name: envoy.filters.network.http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
              access_log:
              - name: envoy.access_loggers.stdout
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
              codec_type: AUTO
              stat_prefix: ingress_https
              route_config:
                name: local_route
                virtual_hosts:
                - name: https
                  domains:
                  - "*"
                  routes:
                  - match:
                      prefix: "/my_app_protos.MyService/"
                    route:
                      cluster: my-app-cluster
                  cors:
                    allow_origin_string_match:
                    - prefix: "*"
                    allow_methods: GET, PUT, DELETE, POST, OPTIONS
                    allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout
                    max_age: "1728000"
                    expose_headers: custom-header-1,grpc-status,grpc-message
              http_filters:
              - name: envoy.filters.http.health_check
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
                  pass_through_mode: false
                  headers:
                  - name: ":path"
                    exact_match: "/healthz"
                  - name: "x-envoy-livenessprobe"
                    exact_match: "healthz"
              - name: envoy.filters.http.grpc_web
              - name: envoy.filters.http.cors
              - name: envoy.filters.http.router
                typed_config: {}
          transport_socket:
            name: envoy.transport_sockets.tls
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
              require_client_certificate: false
              common_tls_context:
                tls_certificates:
                - certificate_chain:
                    filename: /etc/ssl/envoy/tls.crt
                  private_key:
                    filename: /etc/ssl/envoy/tls.key
      clusters:
      - name: my-app-cluster
        connect_timeout: 0.5s
        type: STRICT_DNS
        dns_lookup_family: V4_ONLY
        lb_policy: ROUND_ROBIN
        http2_protocol_options: {}
        load_assignment:
          cluster_name: my-app-cluster
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: my-app.default.svc.cluster.local
                    port_value: 52001
        health_checks:
          timeout: 1s
          interval: 10s
          unhealthy_threshold: 2
          healthy_threshold: 2
          grpc_health_check: {}
    admin:
      access_log_path: /dev/stdout
      address:
        socket_address:
          address: 127.0.0.1
          port_value: 8090

I'm still not sure about specifying the resource requirements for both containers in the Deployment and the number of replicas, but the solution is working.