Search code examples
terraformkubernetes-ingressamazon-eksistioterraform-aws-modules

transport: Error while dialing dial tcp xx.xx.xx.xx15012: i/o timeout with AWS-EKS + Terraform + Istio


I setup a (what I think) is a bog standard EKS cluster using terraform-aws-eks like so:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 18.0"

  cluster_name    = "my-test-cluster"
  cluster_version = "1.21"

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  cluster_addons = {
    coredns = {
      resolve_conflicts = "OVERWRITE"
    }
    kube-proxy = {}
    vpc-cni = {
      resolve_conflicts = "OVERWRITE"
    }
  }

  vpc_id     = var.vpc_id
  subnet_ids = var.subnet_ids

  eks_managed_node_group_defaults = {
    disk_size      = 50
    instance_types = ["m5.large"]
  }

  eks_managed_node_groups = {
    green_test = {
      min_size     = 1
      max_size     = 2
      desired_size = 2

      instance_types = ["t3.large"]
      capacity_type  = "SPOT"
    }
  }
}

then tried to install Istio via the install docs

istioctl install

which resulted in this:

✔ Istio core installed
✔ Istiod installed
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
  Deployment/istio-system/istio-ingressgateway (containers with unready status: [istio-proxy])
- Pruning removed resources                                                                                    Error: failed to install manifests: errors occurred during operation

so I did a bit of digging:

kubectl logs istio-ingressgateway-7fd568fc99-6ql8h -n istio-system

led to

2022-04-17T13:51:14.540346Z warn    ca  ca request failed, starting attempt 1 in 90.275446ms
2022-04-17T13:51:14.631695Z warn    ca  ca request failed, starting attempt 2 in 195.118437ms
2022-04-17T13:51:14.827286Z warn    ca  ca request failed, starting attempt 3 in 394.627125ms
2022-04-17T13:51:15.222738Z warn    ca  ca request failed, starting attempt 4 in 816.437569ms
2022-04-17T13:51:16.039427Z warn    sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:51:33.941084Z warning envoy config    StreamAggregatedResources gRPC config stream closed since 318s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:52:05.830859Z warning envoy config    StreamAggregatedResources gRPC config stream closed since 350s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:52:26.232441Z warning envoy config    StreamAggregatedResources gRPC config stream closed since 370s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"

So from a lot of reading, it seems like maybe the istio-ingressgateway pod is not able to connect to istiod?

Google time, I find this: https://istio.io/latest/docs/ops/diagnostic-tools/proxy-cmd/#verifying-connectivity-to-istiod

kubectl create namespace foo
kubectl apply -f <(istioctl kube-inject -f samples/sleep/sleep.yaml) -n foo

kubectl exec $(kubectl get pod -l app=sleep -n foo -o jsonpath={.items..metadata.name}) -c sleep -n foo -- curl -sS istiod.istio-system:15014/version

which gives me:

curl: (7) Failed to connect to istiod.istio-system port 15014 after 4 ms: Connection refused
command terminated with exit code 7

So I think this problem is not specific to the istio-ingressgateway, but a more general networking issue in a standard EKS cluster?

  1. How would I go about debugging from here, to figure out what the problem is? Are there good resources to understand the networking model of kubernetes and istio?
  2. How come the istio platform docs leave off EKS? Does the istio team not want istio to run on AWS-EKS?
  3. Does this seem like an issue that should be filed against EKS? The aws-eks Terraform module? Istio? I'm not sure exactly where it lands and it seems if I ask for help from one team - another team would almost certainly need to be invloved.
  4. Are there known incompatibilities with Istio and EKS that I should be aware of?

Thanks in advance!

[22-04-18] Update 1:

Ok, so the test with the foo namespace sleep pod leads me to believe that the connection timeout has to do with aws security group rules. The theory is, if security group ports are not opened, you'd see the sort of "connection refused" "io timeout" messages that I see. To test the theory I took the 4 security groups that are created by this module

  1. k8s/EKS/Amazon SG
  2. EKS ENI SG
  3. EKS Cluster SG
  4. EKS Shared node group SG

and opened all traffic up inbound/outbound on all of them.

istioctl install
This will install the Istio 1.13.2 default profile with ["Istio core" "Istiod" "Ingress gateways"] components into the cluster. Proceed? (y/N) y
✔ Istio core installed
✔ Istiod installed
✔ Ingress gateways installed
✔ Installation complete                                                                                                                                                       Making this installation the default for injection and validation.

Et viola! Ok, now I think I need to work backwards and isolate -which- ports and what security group to apply them to, and if they are on the inbound or outbound side. Once I have those, I can PR it back to terraform-aws-eks and save someone else hours of headache.

[22-04-22] Update 2:

Ultimately, I solved this issue - but ran into one more Very Common problem that I saw many others ran into, and had the answer for, but not in a usable format for the terraform-aws-eks module.

After I was able to get the istioctl install to work correctly:

istioctl install --set profile=demo
✔ Istio core installed
✔ Istiod installed
✔ Ingress gateways installed
✔ Installation complete                                                                                                                                                       Making this installation the default for injection and validation.

kubectl label namespace default istio-injection=enabled

kubectl apply -f istio-1.13.2/samples/bookinfo/platform/kube/bookinfo.yaml

I saw all the bookinfo pods/deployments fail to start with this:

Internal error occurred: failed calling 
webhook "namespace.sidecar-injector.istio.io": failed to 
call webhook: Post "https://istiod.istio-system.svc:443
/inject?timeout=10s": context deadline exceeded

The answer to the is problem is similar to the original problem: working fw ports / security group rules. I've added a separate answer below for clarity. It contains a complete working solution of AWS-EKS + Terraform + Istio


Solution

  • BLUF: Installing Istio on terraform-aws-eks requires you to add security group rules allowing communication within the node group. You need:

    1. To add security group rules(ingress/egress) within the shared node security group to open istio ports for istio to install correctly
    2. To add one ingress security group rule on the node security group, from the control plane(EKS) security group for 15017, to resolve the failed calling webhook "namespace.sidecar-injector.istio.io" error.

    Unfortunately, I still don't know why this works since I don't yet understand the order of operations that happens when an istio injected pod comes up in a kubernetes cluster, and who tries to talk to who.

    Research resources

    1. A diagram of the security group architecture for an EKS cluster created by terraform-aws-eks
    2. The ports Istio needs open
    3. A youtube video explaining CNI
    4. The ports Kubernetes uses A diagram of the security group architecture for an EKS cluster created by terraform-aws-eks

    Working Example

    Please see the comments for which sets of rules solves which of the two problems from the original answer

    # Ports needed to correctly install Istio for the error message: transport: Error while dialing dial tcp xx.xx.xx.xx15012: i/o timeout
    locals {
      istio_ports = [
        {
          description = "Envoy admin port / outbound"
          from_port   = 15000
          to_port     = 15001
        },
        {
          description = "Debug port"
          from_port   = 15004
          to_port     = 15004
        },
        {
          description = "Envoy inbound"
          from_port   = 15006
          to_port     = 15006
        },
        {
          description = "HBONE mTLS tunnel port / secure networks XDS and CA services (Plaintext)"
          from_port   = 15008
          to_port     = 15010
        },
        {
          description = "XDS and CA services (TLS and mTLS)"
          from_port   = 15012
          to_port     = 15012
        },
        {
          description = "Control plane monitoring"
          from_port   = 15014
          to_port     = 15014
        },
        {
          description = "Webhook container port, forwarded from 443"
          from_port   = 15017
          to_port     = 15017
        },
        {
          description = "Merged Prometheus telemetry from Istio agent, Envoy, and application, Health checks"
          from_port   = 15020
          to_port     = 15021
        },
        {
          description = "DNS port"
          from_port   = 15053
          to_port     = 15053
        },
        {
          description = "Envoy Prometheus telemetry"
          from_port   = 15090
          to_port     = 15090
        },
        {
          description = "aws-load-balancer-controller"
          from_port   = 9443
          to_port     = 9443
        }
      ]
    
      ingress_rules = {
        for ikey, ivalue in local.istio_ports :
        "${ikey}_ingress" => {
          description = ivalue.description
          protocol    = "tcp"
          from_port   = ivalue.from_port
          to_port     = ivalue.to_port
          type        = "ingress"
          self        = true
        }
      }
    
      egress_rules = {
        for ekey, evalue in local.istio_ports :
        "${ekey}_egress" => {
          description = evalue.description
          protocol    = "tcp"
          from_port   = evalue.from_port
          to_port     = evalue.to_port
          type        = "egress"
          self        = true
        }
      }
    }
    
    # The AWS-EKS Module definition
    module "eks" {
      source  = "terraform-aws-modules/eks/aws"
      version = "~> 18.0"
    
      cluster_name    = "my-test-cluster"
      cluster_version = "1.21"
    
      cluster_endpoint_private_access = true
      cluster_endpoint_public_access  = true
    
      cluster_addons = {
        coredns = {
          resolve_conflicts = "OVERWRITE"
        }
        kube-proxy = {}
        vpc-cni = {
          resolve_conflicts = "OVERWRITE"
        }
      }
    
      vpc_id     = var.vpc_id
      subnet_ids = var.subnet_ids
    
      eks_managed_node_group_defaults = {
        disk_size      = 50
        instance_types = ["m5.large"]
      }
    
      # IMPORTANT
      node_security_group_additional_rules = merge(
        local.ingress_rules,
        local.egress_rules
      )
    
      eks_managed_node_groups = {
        green_test = {
          min_size     = 1
          max_size     = 2
          desired_size = 2
    
          instance_types = ["t3.large"]
          capacity_type  = "SPOT"
        }
      }
    }
    
    # Port needed to solve the error
    # Internal error occurred: failed calling 
    # webhook "namespace.sidecar-injector.istio.io": failed to 
    # call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": # context deadline exceeded
    resource "aws_security_group_rule" "allow_sidecar_injection" {
      description = "Webhook container port, From Control Plane"
      protocol    = "tcp"
      type        = "ingress"
      from_port   = 15017
      to_port     = 15017
    
      security_group_id        = module.eks.node_security_group_id
      source_security_group_id = module.eks.cluster_primary_security_group_id
    }
    
    

    Please excuse my possibly terrible Terraform syntax usage. Happy Kuberneteing!