I setup a (what I think) is a bog standard EKS cluster using terraform-aws-eks like so:
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 18.0"
cluster_name = "my-test-cluster"
cluster_version = "1.21"
cluster_endpoint_private_access = true
cluster_endpoint_public_access = true
cluster_addons = {
coredns = {
resolve_conflicts = "OVERWRITE"
}
kube-proxy = {}
vpc-cni = {
resolve_conflicts = "OVERWRITE"
}
}
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
eks_managed_node_group_defaults = {
disk_size = 50
instance_types = ["m5.large"]
}
eks_managed_node_groups = {
green_test = {
min_size = 1
max_size = 2
desired_size = 2
instance_types = ["t3.large"]
capacity_type = "SPOT"
}
}
}
then tried to install Istio via the install docs
istioctl install
which resulted in this:
✔ Istio core installed
✔ Istiod installed
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
Deployment/istio-system/istio-ingressgateway (containers with unready status: [istio-proxy])
- Pruning removed resources Error: failed to install manifests: errors occurred during operation
so I did a bit of digging:
kubectl logs istio-ingressgateway-7fd568fc99-6ql8h -n istio-system
led to
2022-04-17T13:51:14.540346Z warn ca ca request failed, starting attempt 1 in 90.275446ms
2022-04-17T13:51:14.631695Z warn ca ca request failed, starting attempt 2 in 195.118437ms
2022-04-17T13:51:14.827286Z warn ca ca request failed, starting attempt 3 in 394.627125ms
2022-04-17T13:51:15.222738Z warn ca ca request failed, starting attempt 4 in 816.437569ms
2022-04-17T13:51:16.039427Z warn sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:51:33.941084Z warning envoy config StreamAggregatedResources gRPC config stream closed since 318s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:52:05.830859Z warning envoy config StreamAggregatedResources gRPC config stream closed since 350s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
2022-04-17T13:52:26.232441Z warning envoy config StreamAggregatedResources gRPC config stream closed since 370s ago: 14, connection error: desc = "transport: Error while dialing dial tcp 172.20.55.247:15012: i/o timeout"
So from a lot of reading, it seems like maybe the istio-ingressgateway pod is not able to connect to istiod?
Google time, I find this: https://istio.io/latest/docs/ops/diagnostic-tools/proxy-cmd/#verifying-connectivity-to-istiod
kubectl create namespace foo
kubectl apply -f <(istioctl kube-inject -f samples/sleep/sleep.yaml) -n foo
kubectl exec $(kubectl get pod -l app=sleep -n foo -o jsonpath={.items..metadata.name}) -c sleep -n foo -- curl -sS istiod.istio-system:15014/version
which gives me:
curl: (7) Failed to connect to istiod.istio-system port 15014 after 4 ms: Connection refused
command terminated with exit code 7
So I think this problem is not specific to the istio-ingressgateway, but a more general networking issue in a standard EKS cluster?
Thanks in advance!
[22-04-18] Update 1:
Ok, so the test with the foo namespace sleep pod leads me to believe that the connection timeout has to do with aws security group rules. The theory is, if security group ports are not opened, you'd see the sort of "connection refused" "io timeout" messages that I see. To test the theory I took the 4 security groups that are created by this module
and opened all traffic up inbound/outbound on all of them.
istioctl install
This will install the Istio 1.13.2 default profile with ["Istio core" "Istiod" "Ingress gateways"] components into the cluster. Proceed? (y/N) y
✔ Istio core installed
✔ Istiod installed
✔ Ingress gateways installed
✔ Installation complete Making this installation the default for injection and validation.
Et viola! Ok, now I think I need to work backwards and isolate -which- ports and what security group to apply them to, and if they are on the inbound or outbound side. Once I have those, I can PR it back to terraform-aws-eks and save someone else hours of headache.
[22-04-22] Update 2:
Ultimately, I solved this issue - but ran into one more Very Common problem that I saw many others ran into, and had the answer for, but not in a usable format for the terraform-aws-eks module.
After I was able to get the istioctl install to work correctly:
istioctl install --set profile=demo
✔ Istio core installed
✔ Istiod installed
✔ Ingress gateways installed
✔ Installation complete Making this installation the default for injection and validation.
kubectl label namespace default istio-injection=enabled
kubectl apply -f istio-1.13.2/samples/bookinfo/platform/kube/bookinfo.yaml
I saw all the bookinfo pods/deployments fail to start with this:
Internal error occurred: failed calling
webhook "namespace.sidecar-injector.istio.io": failed to
call webhook: Post "https://istiod.istio-system.svc:443
/inject?timeout=10s": context deadline exceeded
The answer to the is problem is similar to the original problem: working fw ports / security group rules. I've added a separate answer below for clarity. It contains a complete working solution of AWS-EKS + Terraform + Istio
BLUF: Installing Istio on terraform-aws-eks requires you to add security group rules allowing communication within the node group. You need:
failed calling webhook "namespace.sidecar-injector.istio.io"
error.Unfortunately, I still don't know why this works since I don't yet understand the order of operations that happens when an istio injected pod comes up in a kubernetes cluster, and who tries to talk to who.
Please see the comments for which sets of rules solves which of the two problems from the original answer
# Ports needed to correctly install Istio for the error message: transport: Error while dialing dial tcp xx.xx.xx.xx15012: i/o timeout
locals {
istio_ports = [
{
description = "Envoy admin port / outbound"
from_port = 15000
to_port = 15001
},
{
description = "Debug port"
from_port = 15004
to_port = 15004
},
{
description = "Envoy inbound"
from_port = 15006
to_port = 15006
},
{
description = "HBONE mTLS tunnel port / secure networks XDS and CA services (Plaintext)"
from_port = 15008
to_port = 15010
},
{
description = "XDS and CA services (TLS and mTLS)"
from_port = 15012
to_port = 15012
},
{
description = "Control plane monitoring"
from_port = 15014
to_port = 15014
},
{
description = "Webhook container port, forwarded from 443"
from_port = 15017
to_port = 15017
},
{
description = "Merged Prometheus telemetry from Istio agent, Envoy, and application, Health checks"
from_port = 15020
to_port = 15021
},
{
description = "DNS port"
from_port = 15053
to_port = 15053
},
{
description = "Envoy Prometheus telemetry"
from_port = 15090
to_port = 15090
},
{
description = "aws-load-balancer-controller"
from_port = 9443
to_port = 9443
}
]
ingress_rules = {
for ikey, ivalue in local.istio_ports :
"${ikey}_ingress" => {
description = ivalue.description
protocol = "tcp"
from_port = ivalue.from_port
to_port = ivalue.to_port
type = "ingress"
self = true
}
}
egress_rules = {
for ekey, evalue in local.istio_ports :
"${ekey}_egress" => {
description = evalue.description
protocol = "tcp"
from_port = evalue.from_port
to_port = evalue.to_port
type = "egress"
self = true
}
}
}
# The AWS-EKS Module definition
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 18.0"
cluster_name = "my-test-cluster"
cluster_version = "1.21"
cluster_endpoint_private_access = true
cluster_endpoint_public_access = true
cluster_addons = {
coredns = {
resolve_conflicts = "OVERWRITE"
}
kube-proxy = {}
vpc-cni = {
resolve_conflicts = "OVERWRITE"
}
}
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
eks_managed_node_group_defaults = {
disk_size = 50
instance_types = ["m5.large"]
}
# IMPORTANT
node_security_group_additional_rules = merge(
local.ingress_rules,
local.egress_rules
)
eks_managed_node_groups = {
green_test = {
min_size = 1
max_size = 2
desired_size = 2
instance_types = ["t3.large"]
capacity_type = "SPOT"
}
}
}
# Port needed to solve the error
# Internal error occurred: failed calling
# webhook "namespace.sidecar-injector.istio.io": failed to
# call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": # context deadline exceeded
resource "aws_security_group_rule" "allow_sidecar_injection" {
description = "Webhook container port, From Control Plane"
protocol = "tcp"
type = "ingress"
from_port = 15017
to_port = 15017
security_group_id = module.eks.node_security_group_id
source_security_group_id = module.eks.cluster_primary_security_group_id
}
Please excuse my possibly terrible Terraform syntax usage. Happy Kuberneteing!