kubernetes google-kubernetes-engine grpc kubernetes-ingress kubernetes-health-check

gRPC & HTTP servers on GKE Ingress failing healthcheck for gRPC backend

I want to deploy a gRPC + HTTP servers on GKE with HTTP/2 and mutual TLS. My deployment have both a readiness probe and liveness probe with custom path. I expose both the gRPC and HTTP servers via an Ingress.

deployment's probes and exposed ports:

    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /_ah/health
        port: 8443
        scheme: HTTPS
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /_ah/health
        port: 8443
        scheme: HTTPS
    name: grpc-gke
    ports:
    - containerPort: 8443
      protocol: TCP
    - containerPort: 50052
      protocol: TCP

NodePort service:

apiVersion: v1
kind: Service
metadata:
  name: grpc-gke-nodeport
  labels:
    app: grpc-gke
  annotations:
    cloud.google.com/app-protocols: '{"grpc":"HTTP2","http":"HTTP2"}'
    service.alpha.kubernetes.io/app-protocols: '{"grpc":"HTTP2", "http": "HTTP2"}'
spec:
  type: NodePort
  ports:
  - name: grpc
    port: 50052
    protocol: TCP
    targetPort: 50052
  - name: http
    port: 443
    protocol: TCP
    targetPort: 8443
  selector:
    app: grpc-gke

Ingress:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: grpc-gke-ingress
  annotations:
    kubernetes.io/ingress.allow-http: "false"
    #kubernetes.io/ingress.global-static-ip-name: "grpc-gke-ip"
  labels:
    app: grpc-gke
spec:
  rules:
  - http:
      paths:
      - path: /_ah/*
        backend:
          serviceName: grpc-gke-nodeport
          servicePort: 443
  backend:
    serviceName: grpc-gke-nodeport
    servicePort: 50052

The pod does exist, and has a "green" status, before creating the liveness and readiness probes. I see regular logs on my server that both the /_ah/live and /_ah/ready are called by the kube-probe and the server responds with the 200 response.

I use a Google managed TLS certificate on the load balancer (LB). My HTTP server creates a self-signed certificate -- inspired by this blog.

I create the Ingress after I start seeing the probes' logs. After that it creates an LB with two backends, one for the HTTP and one for the gRPC. The HTTP backend's health checks are OK and the HTTP server is accessible from the Internet. The gRPC backend's health check fails thus the LB does not route the gRPC protocol and I receive the 502 error response.

This is with GKE master 1.12.7-gke.10. I also tried newer 1.13 and older 1.11 masters. The cluster has HTTP load balancing enabled and VPC-native enabled. There are firewall rules to allow access from LB to my pods (I even tried to allow all ports from all IP addresses). Delaying the probes does not help either.

Funny thing is that I deployed nearly the same setup, just the server's Docker image is different, couple of months ago and it is running without any issues. I can even deploy new Docker images of the server and everything is great. I cannot find any difference between these two.

There is one another issue, the Ingress is stuck on the "Creating Ingress" state for days. It never finishes and never sees the LB. The Ingress' LB never has a front-end and I always have to manually add an HTTP/2 front-end with a static IP and Google managed TLS certificate. This should be happening only for cluster which were created without "HTTP load balancing", but it happens in my case every time for all my "HTTP load balancing enabled" clusters. The working deployment is in this state for months already.

Any ideas why the gRPC backend's health check could be failing even though I see logs that the readiness and liveness endpoints are called by kube-probe?

EDIT:

describe svc grpc-gke-nodeport

Name:                     grpc-gke-nodeport
Namespace:                default
Labels:                   app=grpc-gke
Annotations:              cloud.google.com/app-protocols: {"grpc":"HTTP2","http":"HTTP2"}
                        kubectl.kubernetes.io/last-applied-configuration:
                            {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"cloud.google.com/app-protocols":"{\"grpc\":\"HTTP2\",\"http\":\"HTTP2\"}",...
                        service.alpha.kubernetes.io/app-protocols: {"grpc":"HTTP2", "http": "HTTP2"}
Selector:                 app=grpc-gke
Type:                     NodePort
IP:                       10.4.8.188
Port:                     grpc  50052/TCP
TargetPort:               50052/TCP
NodePort:                 grpc  32148/TCP
Endpoints:                10.0.0.25:50052
Port:                     http  443/TCP
TargetPort:               8443/TCP
NodePort:                 http  30863/TCP
Endpoints:                10.0.0.25:8443
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

and the health check for the gRPC backend is an HTTP/2 GET using path / on port 32148. Its description is "Default kubernetes L7 Loadbalancing health check." where as the description of the HTTP's back-end health check is "Kubernetes L7 health check generated with readiness probe settings.". Thus the health check for the gRPC back-end is not created from the readiness probe.

Editing the health check to point to port 30863 an changing the path to readiness probe fixes the issue.

Solution

Editing the health check to point to the readiness probe's path and changed the port to the one of the HTTP back-end fixed this issue (look for the port in the HTTP back-end's health check. it is the NodePort's.). It runs know without any issues.

Using the same health check for the gRPC back-end as for the HTTP back-end did not work, it was reset back to its own health check. Even deleting the gRPC back-end's health check did not help, it was recreated. Only editing it to use a different port and path has helped.