Search code examples

Create a Amazon EKS cluster with jenkins-x and cluster-autoscaler gives fails ingress on even number of nodes

I am creating an Amazon EKS cluster using jenkins-x with:

jx create cluster eks -n demo --node-type=t3.xlarge --nodes=1 --nodes-max=5 --nodes-min=1 --skip-installation

After that, I add the cluster-autoscaler IAM policy for auto discovery and the added tags on the autoscaling group and the created instance, according this guide.

I add the rbac roles for tiller and the autoscaler with this file (kubectl create -f rbac-config.yaml):

apiVersion: v1
kind: ServiceAccount
  name: tiller
  namespace: kube-system
kind: ClusterRoleBinding
  name: tiller
  kind: ClusterRole
  name: cluster-admin
  - kind: ServiceAccount
    name: tiller
    namespace: kube-system
apiVersion: v1
kind: ServiceAccount
  name: autoscaler
  namespace: kube-system
kind: ClusterRoleBinding
  name: autoscaler
  kind: ClusterRole
  name: cluster-admin
  - kind: ServiceAccount
    name: autoscaler
    namespace: kube-system

I installed tiller:

helm init --service-account tiller

and installed the cluster autoscaler:

helm install stable/cluster-autoscaler -f cluster-autoscaler-values.yaml --name cluster-autoscaler --namespace kube-system

Then I install the jenkins-x system:

jx install --provider=eks --default-environment-prefix=demo --skip-setup-tiller

I just accept all the defaults on the questions (nginx-ingress is created for me).

Then I create a default spring-boot-rest-prometheus app:

jx create quickstart

again, accepting all the defaults. This works fine, the application is picked up by jenkins is compiled, which I can see in:

and I can reach the app through:

Then I run a test to see if the autoscaler is working correctly, so I open up the file in the charts/spring-boot-rest-prometheus/values.yaml and change replicaCount: 1 to replicaCount: 8. Commit and push. This kicks of the Jenkins pipeline and spins up a new node because the autoscaler sees that there are not enough cpu resources on the first node.

After the second node has come up, I cannot reach Jenkins and the app anymore via the domain names. So for some reason, my ingress is not working anymore.

I have played around with this a lot, and manually changing the desired number of nodes directly on EC2, and when there is an even number of nodes, the domains are not reachable and when there is an odd number of nodes the domains are reachable.

I do not think this is related to the autoscaler, because the scale up and the scale down are working fine, and the problem is also there if I manually change the desired nodes of the server.

What causes the ingress to fail for an even number of nodes? How can I investigate this issue further?

Logs and desriptors for all ingress parts are posted here.


  • FWIW, I seem to have run into this issue:

    Still checking with AWS Support if that's the case for EKS also, but it seems very plausible.