google-kubernetes-engine gke-networking google-anthos google-anthos-service-mesh

Anthos Multi Cluster Ingress - intermittent connectivity and disappearing backend service

I'm running a 2 GKE private cluster set up in europe-west2. I have a dedicated config cluster for MCI and a worker cluster for workloads. Both clusters are registered to Anthos hub and ingress feat enabled on config cluster. In addition worker cluster runs latest ASM 1.12.2.

As far as MCI is concerned my deployment is 'standard' as in based on available docs (ie https://cloud.google.com/architecture/distributed-services-on-gke-private-using-anthos-service-mesh#configure-multi-cluster-ingress, terraform-example-foundation repo etc).

Everything works but I'm hitting an intermittent connectivity issue no matter how many times I redeploy entire stack. My eyes are bleeding from staring at logging dashboard. I ran out of dots to connect.

I'm probing some endpoints presented from my cluster which most of the time returns 200 with following logged under resource.type="http_load_balancer":

{
httpRequest: {
 latency: "0.081658s"
 remoteIp: "20.83.144.189"
 requestMethod: "GET"
 requestSize: "360"
 requestUrl: "https://foo.bar.io/"
 responseSize: "1054"
 serverIp: "100.64.72.136"
 status: 200
 ...
}
insertId: "10mjvz4e8g0nq"
jsonPayload: {
 @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
 statusDetails: "response_sent_by_backend"
}
...
resource: {
 labels: {
  backend_service_name: "mci-4z8mmz-80-asm-ingress-mcs-istio"
  forwarding_rule_name: "mci-4z8mmz-fws-asm-ingress-mci-istio"
  project_id: "prj-foo-bar"
  target_proxy_name: "mci-4z8mmz-asm-ingress-mci-istio"
  url_map_name: "mci-4z8mmz-asm-ingress-mci-istio"
  zone: "global"
 }
 type: "http_load_balancer"
}
severity: "INFO"
spanId: "2a986abfc69bef6f"
timestamp: "2022-02-04T15:24:14.160642Z"
...
}

At random intervals, anything between 1 - 5 hours the probes start failing with 404 for a period of 5 - 10 mins and following is logged:

{
httpRequest: {
 ...
 requestMethod: "GET"
 ...
 requestUrl: "https://foo.bar.io/"
 ...
 status: 404
 ...
}
insertId: "10mjvz4e8g0nq"
jsonPayload: {
 @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
 statusDetails: "internal_error"
}
...
resource: {
 labels: {
  backend_service_name: ""
  forwarding_rule_name: "mci-4z8mmz-fws-asm-ingress-mci-istio"
  project_id: "prj-foo-bar"
  target_proxy_name: "mci-4z8mmz-asm-ingress-mci-istio"
  url_map_name: "mci-4z8mmz-asm-ingress-mci-istio"
  zone: "global"
 }
 type: "http_load_balancer"
}
severity: "WARNING"
...
}

backend_service_name and serverIp disappears and the external LB provisioned via MCI goes for an extended nap. If I try to access the endpoints in a browser during that period i get 404'd and eventually connection was closed.

I've searched logs far and wide and cannot find any leads.

Has anyone experienced a similar issue ? Could this be a regional thing ? I'm yet to try deploying to another region.

Any info/links/ideas much appreciated.

Edit:

I also confirmed that health checks are fine and there are no transitions. Pods never receive the request so 404's are coming from external lb.

Solution

I had a same/similar issue when using a HTTPS with MultiClusterIngress.

Google support suggested to use a literal static IP for the annotation:

networking.gke.io/static-ip: STATIC_IP_ADDRESS

Try using a literal IP like

34.102.201.47

Instead of

https://www.googleapis.com/compute/v1/projects/PROJECT_ID/global/addresses/ADDRESS_NAME

as described in https://cloud.google.com/kubernetes-engine/docs/how-to/multi-cluster-ingress#static

If it doesn't solve the issue, try contacting Google Support