Search code examples
google-kubernetes-enginegrpc-webautopilot

How to configure GKE Autopilot w/Envoy & gRPC-Web


I have an application running on my local machine that uses React -> gRPC-Web -> Envoy -> Go app and everything runs with no problems. I'm trying to deploy this using GKE Autopilot and I just haven't been able to get the configuration right. I'm new to all of GCP/GKE, so I'm looking for help to figure out where I'm going wrong.

I was following this doc initially, even though I only have one gRPC service: https://cloud.google.com/architecture/exposing-grpc-services-on-gke-using-envoy-proxy

From what I've read, GKE Autopilot mode requires using External HTTP(s) load balancing instead of Network Load Balancing as described in the above solution, so I've been trying to get that to work. After a variety of attempts, my current strategy has an Ingress, BackendConfig, Service, and Deployment. The deployment has three containers: my app, an Envoy sidecar to transform the gRPC-Web requests and responses, and a cloud SQL proxy sidecar. I eventually want to be using TLS, but for now, I left that out so it wouldn't complicate things even more.

When I apply all of the configs, the backend service shows one backend in one zone and the health check fails. The health check is set for port 8080 and path /healthz which is what I think I've specified in the deployment config, but I'm suspicious because when I look at the details for the envoy-sidecar container, it shows the Readiness probe as: http-get HTTP://:0/healthz headers=x-envoy-livenessprobe:healthz. Does ":0" just mean it's using the default address and port for the container, or does indicate a config problem?

I've been reading various docs and just haven't been able to piece it all together. Is there an example somewhere that shows how this can be done? I've been searching and haven't found one.

My current configs are:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grammar-games-ingress
  #annotations:
    # If the class annotation is not specified it defaults to "gce".
    # kubernetes.io/ingress.class: "gce"
    # kubernetes.io/ingress.global-static-ip-name: <IP addr>
spec:
  defaultBackend:
    service:
      name: grammar-games-core
      port:
        number: 80
---
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: grammar-games-bec
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
spec:
  sessionAffinity:
    affinityType: "CLIENT_IP"  
  healthCheck:
    checkIntervalSec: 15
    port: 8080
    type: HTTP
    requestPath: /healthz
  timeoutSec: 60
---
apiVersion: v1
kind: Service
metadata:
  name: grammar-games-core
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
    cloud.google.com/app-protocols: '{"http":"HTTP"}'
    cloud.google.com/backend-config: '{"default": "grammar-games-bec"}'
spec:
  type: ClusterIP
  selector:
    app: grammar-games-core
  ports:
  - name: http
    protocol: TCP
    port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grammar-games-core
spec:
  # Two replicas for right now, just so I can see how RPC calls get directed.
  # replicas: 2
  selector:
    matchLabels:
      app: grammar-games-core
  template:
    metadata:
      labels:
        app: grammar-games-core
    spec:
      serviceAccountName: grammar-games-core-k8sa
      containers:
      - name: grammar-games-core
        image: gcr.io/grammar-games/grammar-games-core:1.1.2
        command:
          - "/bin/grammar-games-core"
        ports:
        - containerPort: 52001
        env:
        - name: GAMESDB_USER
          valueFrom:
            secretKeyRef:
              name: gamesdb-config
              key: username
        - name: GAMESDB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: gamesdb-config
              key: password
        - name: GAMESDB_DB_NAME
          valueFrom:
            secretKeyRef:
              name: gamesdb-config
              key: db-name 
        - name: GRPC_SERVER_PORT
          value: '52001'
        - name: GAMES_LOG_FILE_PATH
          value: ''
        - name: GAMESDB_LOG_LEVEL
          value: 'debug'
        resources:
          requests:
            # The proxy's memory use scales linearly with the number of active
            # connections. Fewer open connections will use less memory. Adjust
            # this value based on your application's requirements.
            memory: "2Gi"
            # The proxy's CPU use scales linearly with the amount of IO between
            # the database and the application. Adjust this value based on your
            # application's requirements.
            cpu:    "1"
        readinessProbe:
          exec:
            command: ["/bin/grpc_health_probe", "-addr=:52001"]
          initialDelaySeconds: 5
      - name: cloud-sql-proxy
        # It is recommended to use the latest version of the Cloud SQL proxy
        # Make sure to update on a regular schedule!
        image: gcr.io/cloudsql-docker/gce-proxy:1.24.0
        command:
          - "/cloud_sql_proxy"

          # If connecting from a VPC-native GKE cluster, you can use the
          # following flag to have the proxy connect over private IP
          # - "-ip_address_types=PRIVATE"

          # Replace DB_PORT with the port the proxy should listen on
          # Defaults: MySQL: 3306, Postgres: 5432, SQLServer: 1433
          - "-instances=grammar-games:us-east1:grammar-games-db=tcp:3306"
        securityContext:
          # The default Cloud SQL proxy image runs as the
          # "nonroot" user and group (uid: 65532) by default.
          runAsNonRoot: true
        # Resource configuration depends on an application's requirements. You
        # should adjust the following values based on what your application
        # needs. For details, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
        resources:
          requests:
            # The proxy's memory use scales linearly with the number of active
            # connections. Fewer open connections will use less memory. Adjust
            # this value based on your application's requirements.
            memory: "2Gi"
            # The proxy's CPU use scales linearly with the amount of IO between
            # the database and the application. Adjust this value based on your
            # application's requirements.
            cpu:    "1"
      - name: envoy-sidecar
        image: envoyproxy/envoy:v1.20-latest
        ports:
        - name: http
          containerPort: 8080
        resources:
          requests:
            cpu: 10m
            ephemeral-storage: 256Mi
            memory: 256Mi
        volumeMounts:
        - name: config
          mountPath: /etc/envoy
        readinessProbe:
          httpGet:
            port: http
            httpHeaders:
            - name: x-envoy-livenessprobe
              value: healthz
            path: /healthz
            scheme: HTTP
      volumes:
      - name: config
        configMap:
          name: envoy-sidecar-conf      
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-sidecar-conf
data:
  envoy.yaml: |
    static_resources:
      listeners:
      - name: listener_0
        address:
          socket_address:
            address: 0.0.0.0
            port_value: 8080
        filter_chains:
        - filters:
          - name: envoy.filters.network.http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
              access_log:
              - name: envoy.access_loggers.stdout
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
              codec_type: AUTO
              stat_prefix: ingress_http
              route_config:
                name: local_route
                virtual_hosts:
                - name: http
                  domains:
                  - "*"
                  routes:
                  - match:
                      prefix: "/grammar_games_protos.GrammarGames/"
                    route:
                      cluster: grammar-games-core-grpc
                  cors:
                    allow_origin_string_match:
                    - prefix: "*"
                    allow_methods: GET, PUT, DELETE, POST, OPTIONS
                    allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout
                    max_age: "1728000"
                    expose_headers: custom-header-1,grpc-status,grpc-message
              http_filters:
              - name: envoy.filters.http.health_check
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
                  pass_through_mode: false
                  headers:
                  - name: ":path"
                    exact_match: "/healthz"
                  - name: "x-envoy-livenessprobe"
                    exact_match: "healthz"
              - name: envoy.filters.http.grpc_web
              - name: envoy.filters.http.cors
              - name: envoy.filters.http.router
                typed_config: {}
      clusters:
      - name: grammar-games-core-grpc
        connect_timeout: 0.5s
        type: logical_dns
        lb_policy: ROUND_ROBIN
        http2_protocol_options: {}
        load_assignment:
          cluster_name: grammar-games-core-grpc
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: 0.0.0.0
                    port_value: 52001
        health_checks:
          timeout: 1s
          interval: 10s
          unhealthy_threshold: 2
          healthy_threshold: 2
          grpc_health_check: {}
    admin:
      access_log_path: /dev/stdout
      address:
        socket_address:
          address: 127.0.0.1
          port_value: 8090


Solution

  • I've finally gotten through this issue, so wanted to post the answer I have for reference.

    Turns out that the solution in this document works:

    https://cloud.google.com/architecture/exposing-grpc-services-on-gke-using-envoy-proxy#introduction

    Somewhere in one of the docs about the GKE autopilot mode, I got the impression that you can't use a Network Load Balancer, and instead, need to use the Ingress for HTTP(S) Load Balancing. That's why I was pursuing the other approach, but even after working with Google support for several weeks, the configs all looked correct, but the healthcheck from the load balancer just would not work correctly. That's when we figured out that this solution with the Network Load Balancer actually will work.

    I also had some issues with getting https/TLS configured. That turned out to be an issue in my envoy config file.

    I still have one remaining issue with the stability of the pods, but that's a separate issue that I'll pursue in a different post/question. As long as I only ask for 1 replica, the solution is stable and working well and autopilot is supposed to scale up the pods as necessary.

    I know the config for all of this can all be very tricky, so I'm including it all here for reference (just using my-app as a placeholder). Hopefully it will help someone else get there faster than I did! I think it's a great solution with gRPC-Web once you can get it working. You'll also notice that I'm using the cloud-sql-proxy sidecar to talk to the DB Cloud SQL and I'm using a GKE service account for authentication.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: my-app
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: my-app
      template:
        metadata:
          labels:
            app: my-app
        spec:
          serviceAccountName: my-app-k8sa
          terminationGracePeriodSeconds: 30
          containers:
          - name: my-app
            image: gcr.io/my-project/my-app:1.1.0
            command:
              - "/bin/my-app"
            ports:
            - containerPort: 52001
            env:
            - name: DB_USER
              valueFrom:
                secretKeyRef:
                  name: db-config
                  key: username
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-config
                  key: password
            - name: DB_NAME
              valueFrom:
                secretKeyRef:
                  name: db-config
                  key: db-name 
            - name: GRPC_SERVER_PORT
              value: '52001'
            readinessProbe:
              exec:
                command: ["/bin/grpc_health_probe", "-addr=:52001"]
              initialDelaySeconds: 10
            livenessProbe:
              exec:
                command: ["/bin/grpc_health_probe", "-addr=:52001"]
              initialDelaySeconds: 15
          - name: cloud-sql-proxy
            # It is recommended to use the latest version of the Cloud SQL proxy
            # Make sure to update on a regular schedule!
            image: gcr.io/cloudsql-docker/gce-proxy:1.27.1
            command:
              - "/cloud_sql_proxy"
    
              # If connecting from a VPC-native GKE cluster, you can use the
              # following flag to have the proxy connect over private IP
              # - "-ip_address_types=PRIVATE"
    
              # Replace DB_PORT with the port the proxy should listen on
              # Defaults: MySQL: 3306, Postgres: 5432, SQLServer: 1433
              - "-instances=my-project:us-east1:my-app-db=tcp:3306"
            securityContext:
              # The default Cloud SQL proxy image runs as the
              # "nonroot" user and group (uid: 65532) by default.
              runAsNonRoot: true
    
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: my-app
    spec:
      type: ClusterIP
      selector:
        app: my-app
      ports:
      - name: my-app-port
        protocol: TCP
        port: 52001
      clusterIP: None
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: envoy
    spec:
      type: LoadBalancer
      selector:
        app: envoy
      ports:
      - name: https
        protocol: TCP
        port: 443
        targetPort: 8443
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: envoy
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: envoy
      template:
        metadata:
          labels:
            app: envoy
        spec:
          containers:
          - name: envoy
            image: envoyproxy/envoy:v1.20-latest
            ports:
            - name: https
              containerPort: 8443
            resources:
              requests:
                cpu: 10m
                ephemeral-storage: 256Mi
                memory: 256Mi
            volumeMounts:
            - name: config
              mountPath: /etc/envoy
            - name: certs
              mountPath: /etc/ssl/envoy
            readinessProbe:
              httpGet:
                port: https
                httpHeaders:
                - name: x-envoy-livenessprobe
                  value: healthz
                path: /healthz
                scheme: HTTPS
          volumes:
          - name: config
            configMap:
              name: envoy-conf
          - name: certs
            secret:
              secretName: envoy-certs
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: envoy-conf
    data:
      envoy.yaml: |
        static_resources:
          listeners:
          - name: listener_0
            address:
              socket_address:
                address: 0.0.0.0
                port_value: 8443
            filter_chains:
            - filters:
              - name: envoy.filters.network.http_connection_manager
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                  access_log:
                  - name: envoy.access_loggers.stdout
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
                  codec_type: AUTO
                  stat_prefix: ingress_https
                  route_config:
                    name: local_route
                    virtual_hosts:
                    - name: https
                      domains:
                      - "*"
                      routes:
                      - match:
                          prefix: "/my_app_protos.MyService/"
                        route:
                          cluster: my-app-cluster
                      cors:
                        allow_origin_string_match:
                        - prefix: "*"
                        allow_methods: GET, PUT, DELETE, POST, OPTIONS
                        allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout
                        max_age: "1728000"
                        expose_headers: custom-header-1,grpc-status,grpc-message
                  http_filters:
                  - name: envoy.filters.http.health_check
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
                      pass_through_mode: false
                      headers:
                      - name: ":path"
                        exact_match: "/healthz"
                      - name: "x-envoy-livenessprobe"
                        exact_match: "healthz"
                  - name: envoy.filters.http.grpc_web
                  - name: envoy.filters.http.cors
                  - name: envoy.filters.http.router
                    typed_config: {}
              transport_socket:
                name: envoy.transport_sockets.tls
                typed_config:
                  "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
                  require_client_certificate: false
                  common_tls_context:
                    tls_certificates:
                    - certificate_chain:
                        filename: /etc/ssl/envoy/tls.crt
                      private_key:
                        filename: /etc/ssl/envoy/tls.key
          clusters:
          - name: my-app-cluster
            connect_timeout: 0.5s
            type: STRICT_DNS
            dns_lookup_family: V4_ONLY
            lb_policy: ROUND_ROBIN
            http2_protocol_options: {}
            load_assignment:
              cluster_name: my-app-cluster
              endpoints:
              - lb_endpoints:
                - endpoint:
                    address:
                      socket_address:
                        address: my-app.default.svc.cluster.local
                        port_value: 52001
            health_checks:
              timeout: 1s
              interval: 10s
              unhealthy_threshold: 2
              healthy_threshold: 2
              grpc_health_check: {}
        admin:
          access_log_path: /dev/stdout
          address:
            socket_address:
              address: 127.0.0.1
              port_value: 8090
    

    I'm still not sure about specifying the resource requirements for both containers in the Deployment and the number of replicas, but the solution is working.