I have an application running on my local machine that uses React -> gRPC-Web -> Envoy -> Go app and everything runs with no problems. I'm trying to deploy this using GKE Autopilot and I just haven't been able to get the configuration right. I'm new to all of GCP/GKE, so I'm looking for help to figure out where I'm going wrong.
I was following this doc initially, even though I only have one gRPC service: https://cloud.google.com/architecture/exposing-grpc-services-on-gke-using-envoy-proxy
From what I've read, GKE Autopilot mode requires using External HTTP(s) load balancing instead of Network Load Balancing as described in the above solution, so I've been trying to get that to work. After a variety of attempts, my current strategy has an Ingress, BackendConfig, Service, and Deployment. The deployment has three containers: my app, an Envoy sidecar to transform the gRPC-Web requests and responses, and a cloud SQL proxy sidecar. I eventually want to be using TLS, but for now, I left that out so it wouldn't complicate things even more.
When I apply all of the configs, the backend service shows one backend in one zone and the health check fails. The health check is set for port 8080 and path /healthz which is what I think I've specified in the deployment config, but I'm suspicious because when I look at the details for the envoy-sidecar container, it shows the Readiness probe as: http-get HTTP://:0/healthz headers=x-envoy-livenessprobe:healthz. Does ":0" just mean it's using the default address and port for the container, or does indicate a config problem?
I've been reading various docs and just haven't been able to piece it all together. Is there an example somewhere that shows how this can be done? I've been searching and haven't found one.
My current configs are:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grammar-games-ingress
#annotations:
# If the class annotation is not specified it defaults to "gce".
# kubernetes.io/ingress.class: "gce"
# kubernetes.io/ingress.global-static-ip-name: <IP addr>
spec:
defaultBackend:
service:
name: grammar-games-core
port:
number: 80
---
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: grammar-games-bec
annotations:
cloud.google.com/neg: '{"ingress": true}'
spec:
sessionAffinity:
affinityType: "CLIENT_IP"
healthCheck:
checkIntervalSec: 15
port: 8080
type: HTTP
requestPath: /healthz
timeoutSec: 60
---
apiVersion: v1
kind: Service
metadata:
name: grammar-games-core
annotations:
cloud.google.com/neg: '{"ingress": true}'
cloud.google.com/app-protocols: '{"http":"HTTP"}'
cloud.google.com/backend-config: '{"default": "grammar-games-bec"}'
spec:
type: ClusterIP
selector:
app: grammar-games-core
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: grammar-games-core
spec:
# Two replicas for right now, just so I can see how RPC calls get directed.
# replicas: 2
selector:
matchLabels:
app: grammar-games-core
template:
metadata:
labels:
app: grammar-games-core
spec:
serviceAccountName: grammar-games-core-k8sa
containers:
- name: grammar-games-core
image: gcr.io/grammar-games/grammar-games-core:1.1.2
command:
- "/bin/grammar-games-core"
ports:
- containerPort: 52001
env:
- name: GAMESDB_USER
valueFrom:
secretKeyRef:
name: gamesdb-config
key: username
- name: GAMESDB_PASSWORD
valueFrom:
secretKeyRef:
name: gamesdb-config
key: password
- name: GAMESDB_DB_NAME
valueFrom:
secretKeyRef:
name: gamesdb-config
key: db-name
- name: GRPC_SERVER_PORT
value: '52001'
- name: GAMES_LOG_FILE_PATH
value: ''
- name: GAMESDB_LOG_LEVEL
value: 'debug'
resources:
requests:
# The proxy's memory use scales linearly with the number of active
# connections. Fewer open connections will use less memory. Adjust
# this value based on your application's requirements.
memory: "2Gi"
# The proxy's CPU use scales linearly with the amount of IO between
# the database and the application. Adjust this value based on your
# application's requirements.
cpu: "1"
readinessProbe:
exec:
command: ["/bin/grpc_health_probe", "-addr=:52001"]
initialDelaySeconds: 5
- name: cloud-sql-proxy
# It is recommended to use the latest version of the Cloud SQL proxy
# Make sure to update on a regular schedule!
image: gcr.io/cloudsql-docker/gce-proxy:1.24.0
command:
- "/cloud_sql_proxy"
# If connecting from a VPC-native GKE cluster, you can use the
# following flag to have the proxy connect over private IP
# - "-ip_address_types=PRIVATE"
# Replace DB_PORT with the port the proxy should listen on
# Defaults: MySQL: 3306, Postgres: 5432, SQLServer: 1433
- "-instances=grammar-games:us-east1:grammar-games-db=tcp:3306"
securityContext:
# The default Cloud SQL proxy image runs as the
# "nonroot" user and group (uid: 65532) by default.
runAsNonRoot: true
# Resource configuration depends on an application's requirements. You
# should adjust the following values based on what your application
# needs. For details, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
resources:
requests:
# The proxy's memory use scales linearly with the number of active
# connections. Fewer open connections will use less memory. Adjust
# this value based on your application's requirements.
memory: "2Gi"
# The proxy's CPU use scales linearly with the amount of IO between
# the database and the application. Adjust this value based on your
# application's requirements.
cpu: "1"
- name: envoy-sidecar
image: envoyproxy/envoy:v1.20-latest
ports:
- name: http
containerPort: 8080
resources:
requests:
cpu: 10m
ephemeral-storage: 256Mi
memory: 256Mi
volumeMounts:
- name: config
mountPath: /etc/envoy
readinessProbe:
httpGet:
port: http
httpHeaders:
- name: x-envoy-livenessprobe
value: healthz
path: /healthz
scheme: HTTP
volumes:
- name: config
configMap:
name: envoy-sidecar-conf
---
apiVersion: v1
kind: ConfigMap
metadata:
name: envoy-sidecar-conf
data:
envoy.yaml: |
static_resources:
listeners:
- name: listener_0
address:
socket_address:
address: 0.0.0.0
port_value: 8080
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
access_log:
- name: envoy.access_loggers.stdout
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
codec_type: AUTO
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: http
domains:
- "*"
routes:
- match:
prefix: "/grammar_games_protos.GrammarGames/"
route:
cluster: grammar-games-core-grpc
cors:
allow_origin_string_match:
- prefix: "*"
allow_methods: GET, PUT, DELETE, POST, OPTIONS
allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout
max_age: "1728000"
expose_headers: custom-header-1,grpc-status,grpc-message
http_filters:
- name: envoy.filters.http.health_check
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
pass_through_mode: false
headers:
- name: ":path"
exact_match: "/healthz"
- name: "x-envoy-livenessprobe"
exact_match: "healthz"
- name: envoy.filters.http.grpc_web
- name: envoy.filters.http.cors
- name: envoy.filters.http.router
typed_config: {}
clusters:
- name: grammar-games-core-grpc
connect_timeout: 0.5s
type: logical_dns
lb_policy: ROUND_ROBIN
http2_protocol_options: {}
load_assignment:
cluster_name: grammar-games-core-grpc
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 0.0.0.0
port_value: 52001
health_checks:
timeout: 1s
interval: 10s
unhealthy_threshold: 2
healthy_threshold: 2
grpc_health_check: {}
admin:
access_log_path: /dev/stdout
address:
socket_address:
address: 127.0.0.1
port_value: 8090
I've finally gotten through this issue, so wanted to post the answer I have for reference.
Turns out that the solution in this document works:
https://cloud.google.com/architecture/exposing-grpc-services-on-gke-using-envoy-proxy#introduction
Somewhere in one of the docs about the GKE autopilot mode, I got the impression that you can't use a Network Load Balancer, and instead, need to use the Ingress for HTTP(S) Load Balancing. That's why I was pursuing the other approach, but even after working with Google support for several weeks, the configs all looked correct, but the healthcheck from the load balancer just would not work correctly. That's when we figured out that this solution with the Network Load Balancer actually will work.
I also had some issues with getting https/TLS configured. That turned out to be an issue in my envoy config file.
I still have one remaining issue with the stability of the pods, but that's a separate issue that I'll pursue in a different post/question. As long as I only ask for 1 replica, the solution is stable and working well and autopilot is supposed to scale up the pods as necessary.
I know the config for all of this can all be very tricky, so I'm including it all here for reference (just using my-app as a placeholder). Hopefully it will help someone else get there faster than I did! I think it's a great solution with gRPC-Web once you can get it working. You'll also notice that I'm using the cloud-sql-proxy sidecar to talk to the DB Cloud SQL and I'm using a GKE service account for authentication.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
serviceAccountName: my-app-k8sa
terminationGracePeriodSeconds: 30
containers:
- name: my-app
image: gcr.io/my-project/my-app:1.1.0
command:
- "/bin/my-app"
ports:
- containerPort: 52001
env:
- name: DB_USER
valueFrom:
secretKeyRef:
name: db-config
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-config
key: password
- name: DB_NAME
valueFrom:
secretKeyRef:
name: db-config
key: db-name
- name: GRPC_SERVER_PORT
value: '52001'
readinessProbe:
exec:
command: ["/bin/grpc_health_probe", "-addr=:52001"]
initialDelaySeconds: 10
livenessProbe:
exec:
command: ["/bin/grpc_health_probe", "-addr=:52001"]
initialDelaySeconds: 15
- name: cloud-sql-proxy
# It is recommended to use the latest version of the Cloud SQL proxy
# Make sure to update on a regular schedule!
image: gcr.io/cloudsql-docker/gce-proxy:1.27.1
command:
- "/cloud_sql_proxy"
# If connecting from a VPC-native GKE cluster, you can use the
# following flag to have the proxy connect over private IP
# - "-ip_address_types=PRIVATE"
# Replace DB_PORT with the port the proxy should listen on
# Defaults: MySQL: 3306, Postgres: 5432, SQLServer: 1433
- "-instances=my-project:us-east1:my-app-db=tcp:3306"
securityContext:
# The default Cloud SQL proxy image runs as the
# "nonroot" user and group (uid: 65532) by default.
runAsNonRoot: true
---
apiVersion: v1
kind: Service
metadata:
name: my-app
spec:
type: ClusterIP
selector:
app: my-app
ports:
- name: my-app-port
protocol: TCP
port: 52001
clusterIP: None
---
apiVersion: v1
kind: Service
metadata:
name: envoy
spec:
type: LoadBalancer
selector:
app: envoy
ports:
- name: https
protocol: TCP
port: 443
targetPort: 8443
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: envoy
spec:
replicas: 1
selector:
matchLabels:
app: envoy
template:
metadata:
labels:
app: envoy
spec:
containers:
- name: envoy
image: envoyproxy/envoy:v1.20-latest
ports:
- name: https
containerPort: 8443
resources:
requests:
cpu: 10m
ephemeral-storage: 256Mi
memory: 256Mi
volumeMounts:
- name: config
mountPath: /etc/envoy
- name: certs
mountPath: /etc/ssl/envoy
readinessProbe:
httpGet:
port: https
httpHeaders:
- name: x-envoy-livenessprobe
value: healthz
path: /healthz
scheme: HTTPS
volumes:
- name: config
configMap:
name: envoy-conf
- name: certs
secret:
secretName: envoy-certs
---
apiVersion: v1
kind: ConfigMap
metadata:
name: envoy-conf
data:
envoy.yaml: |
static_resources:
listeners:
- name: listener_0
address:
socket_address:
address: 0.0.0.0
port_value: 8443
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
access_log:
- name: envoy.access_loggers.stdout
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
codec_type: AUTO
stat_prefix: ingress_https
route_config:
name: local_route
virtual_hosts:
- name: https
domains:
- "*"
routes:
- match:
prefix: "/my_app_protos.MyService/"
route:
cluster: my-app-cluster
cors:
allow_origin_string_match:
- prefix: "*"
allow_methods: GET, PUT, DELETE, POST, OPTIONS
allow_headers: keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout
max_age: "1728000"
expose_headers: custom-header-1,grpc-status,grpc-message
http_filters:
- name: envoy.filters.http.health_check
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
pass_through_mode: false
headers:
- name: ":path"
exact_match: "/healthz"
- name: "x-envoy-livenessprobe"
exact_match: "healthz"
- name: envoy.filters.http.grpc_web
- name: envoy.filters.http.cors
- name: envoy.filters.http.router
typed_config: {}
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
require_client_certificate: false
common_tls_context:
tls_certificates:
- certificate_chain:
filename: /etc/ssl/envoy/tls.crt
private_key:
filename: /etc/ssl/envoy/tls.key
clusters:
- name: my-app-cluster
connect_timeout: 0.5s
type: STRICT_DNS
dns_lookup_family: V4_ONLY
lb_policy: ROUND_ROBIN
http2_protocol_options: {}
load_assignment:
cluster_name: my-app-cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: my-app.default.svc.cluster.local
port_value: 52001
health_checks:
timeout: 1s
interval: 10s
unhealthy_threshold: 2
healthy_threshold: 2
grpc_health_check: {}
admin:
access_log_path: /dev/stdout
address:
socket_address:
address: 127.0.0.1
port_value: 8090
I'm still not sure about specifying the resource requirements for both containers in the Deployment and the number of replicas, but the solution is working.