Search code examples
kubernetes-helmkubernetes-ingressamazon-eksaws-fargateingress-nginx

crashing ingress-nginx controller on Fargate-only EKS cluster due to bind() to 0.0.0.0:8443 failed (98: Address in use)


The ingress-nginx pod I have helm-installed into my EKS cluster is perpetually failing, its logs indicating the application cannot bind to 0.0.0.0:8443 (INADDR_ANY:8443). I have confirmed that 0.0.0.0:8443 is indeed already bound in the container, but bc I don't yet have root access to the container I've been unable to glean the culprit process/user.

I have created this issue on the kubernetes ingress-nginx project that I'm using, but also wanted to reach out to a wider SO community that might lend insights, solutions and troubleshooting suggestions for how to get past this hurdle.

Being a newcomer to both AWS/EKS and Kubernetes, it is likely that there is some environment configuration error causing this issue. For example, is it possible that this could be caused by a misconfigured AWS-ism such as the VPC (its Subnets or Security Groups)? Thank you in advance for your help!

The linked GitHub issue provides copious details about the Terraform-provisioned EKS environment as well as the Helm-installed deployment of ingress-nginx. Here are some key details:

  1. The EKS cluster is configured to only use Fargate workers, and has 3 public and 3 private subnets, all 6 of which are made available to the cluster and each of its Fargate profiles.
  2. It should also be noted that the cluster is new, and the ingress-nginx pod is the first attempt to deploy anything to the cluster, aside from kube-system items like coredns, which has been configured to run in Fargate. (which required manually removing the default ec2 annotation as described here)
  3. There are 6 fargate profiles, but only 2 that are currently in use: coredns and ingress. These are dedicated to kube-system/kube-dns and ingress-nginx, respectively. Other than the selectors' namespaces and labels, there is nothing "custom" about the profile specification. It has been confirmed that the selectors are working, both for coredns and ingress. I.e. the ingress pods are scheduled to run, but failing.
  4. The reason why ingress-nginx is using port 8443 is that I first ran into this Privilege Escalation issue whose workaround requires one to disable allowPrivilegeEscalation and change ports from privileged to unprivileged ones. I'm invoking helm install with the following values:
controller: 
  extraArgs: 
    http-port: 8080 
    https-port: 8443 
  containerPort: 
    http: 8080 
    https: 8443 
  service: 
    ports: 
      http: 80 
      https: 443 
    targetPorts: 
      http: 8080 
      https: 8443 
  image: 
    allowPrivilegeEscalation: false
    # https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes
    livenessProbe:
      initialDelaySeconds: 60  # 30
    readinessProbe:
      initialDelaySeconds: 60  # 0
  service:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
  1. As my original observation (before I looked at the logs) was that the K8s liveness/readiness probes were failing/timing out, I first experimented with extending their initialDelaySeconds in the values passed to the helm install. But eventually I looked at the pod/container logs and found that regardless of the *ness probe settings, every time I reinstall the ingress-nginx pod and wait a bit, the logs will indicate the same bind error reported here:
2021/11/12 17:15:02 [emerg] 27#27: bind() to [::]:8443 failed (98: Address in use)
.
.```
6. Aside from what I've noted above, I haven't intentionally configured anything "non-stock". I'm a bit lost in AWS/K8s's sea of configuration looking for what piece I need to adapt/correct.

Do you have clues or guesses why INADDR_ANY, port 8443 would already be bound in my (fairly-standard) `nginx-ingress-ingress-nginx-controller` pod/container?

As I aluded earlier, I am able to execute `netstat` command inside the running container as default user `www-data` to confirm indeed 0:8443 is already bound, but because I haven't yet figured out how to get root access, the PID/name of the processes are not available to me:

```> kubectl exec -n ingress --stdin --tty nginx-ingress-ingress-nginx-controller-74d46b8fd8-85tkh -- netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:10245         0.0.0.0:*               LISTEN      -
tcp        3      0 127.0.0.1:10246         0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:10247         0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:8181            0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:8181            0.0.0.0:*               LISTEN      -
tcp        0      0 :::8443                 :::*                    LISTEN      -
tcp        0      0 :::10254                :::*                    LISTEN      -
tcp        0      0 :::8080                 :::*                    LISTEN      -
tcp        0      0 :::8080                 :::*                    LISTEN      -
tcp        0      0 :::8181                 :::*                    LISTEN      -
tcp        0      0 :::8181                 :::*                    LISTEN      -```

```> kubectl exec -n ingress --stdin --tty nginx-ingress-ingress-nginx-controller-74d46b8fd8-85tkh -- /bin/bash
bash-5.1$ whoami
www-data
bash-5.1$ ps aux
PID   USER     TIME  COMMAND
    1 www-data  0:00 /usr/bin/dumb-init -- /nginx-ingress-controller --publish-service=ingress/nginx-ingress-ingress-nginx-controller --election-id=ingress-controller-leader --controller-class=k8s.io/ingress-nginx
    8 www-data  0:00 /nginx-ingress-controller --publish-service=ingress/nginx-ingress-ingress-nginx-controller --election-id=ingress-controller-leader --controller-class=k8s.io/ingress-nginx --configmap=ingress/n
   28 www-data  0:00 nginx: master process /usr/local/nginx/sbin/nginx -c /etc/nginx/nginx.conf
   30 www-data  0:00 nginx: worker process
   45 www-data  0:00 /bin/bash
   56 www-data  0:00 ps aux```

I'm currently looking into how to get root access to my Fargate containers (without mucking about with their Dockerfiles to install ssh..) so I can get more insight into what process/user is binding INADDR_ANY:8443 in this pod/container.

Solution

  • Posted community wiki answer based on the same topic and this similar issue (both on GitHub page). Feel free to expand it.


    The answer from the GitHub:

    The problem is that 8443 is already bound for the webhook. That's why I used 8081 in my suggestion, not 8443. The examples using 8443 here had to also move the webhook, which introduces more complexity to the changes, and can lead to weird issues if you get it wrong.

    An example with used 8081 port:

    As well as those settings, you'll also need to use the appropriate annotations to run using NLB rather than ELB, so all-up it ends up looking something like

    controller:
     extraArgs:
       http-port: 8080
       https-port: 8081
    
     containerPort:
       http: 8080
       https: 8081
    
     image:
       allowPrivilegeEscalation: false
    
     service:
       annotations:
         service.beta.kubernetes.io/aws-load-balancer-type: "nlb-ip"
    

    Edit: Fixed the aws-load-balancer-type to be nlb-ip, as that's required for Fargate. It probably should be

    service.beta.kubernetes.io/aws-load-balancer-type: "external"
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
    

    for current versions of the AWS Load Balancer controller (2.2 onwards), but new versions will recognise the nlb-ip annotation