Kubernetes nginx ingress controller is unreliable

I need help understanding in detail how an ingress controller, specifically the ingress-nginx ingress controller, is supposed to work. To me, it appears as a black box that is supposed to listen on a public IP, terminate TLS, and forward traffic to a pod. But exactly how that happens is a mystery to me.

The primary goal here is understanding, the secondary goal is troubleshooting an immediate issue I'm facing.

I have a cluster with five nodes, and am trying to get the Jupyterhub application to run on it. For the most part, it is working fine. I'm using a pretty standard Rancher RKE setup with flannel/calico for the networking. The nodes run RedHat 7.9 with iptables and firewalld, and docker 19.03.

The Jupyterhub proxy is set up with a ClusterIP service (I also tried a NodePort service, that also works). I also set up an ingress. The ingress sometimes works, but oftentimes does not respond (connection times out). Specifically, if I delete the ingress, and then redeploy my helm chart, the ingress will start working. Also, if I restart one of my nodes, the ingress will start working again. I have not identified the circumstances when the ingress stops working.

Here are my relevant services:

kubectl get services
NAME                       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
hub                        ClusterIP   10.32.0.183   <none>        8081/TCP   378d
proxy-api                  ClusterIP   10.32.0.11    <none>        8001/TCP   378d
proxy-public               ClusterIP   10.32.0.30    <none>        80/TCP     378d

This works; telnet 10.32.0.30 80 responds as expected (of course only from one of the nodes). I can also telnet directly to the proxy-public pod (10.244.4.41:8000 in my case).

Here is my ingress.

kubectl describe ingress
Name:             jupyterhub
Labels:           app=jupyterhub
                  app.kubernetes.io/managed-by=Helm
                  chart=jupyterhub-1.2.0
                  component=ingress
                  heritage=Helm
                  release=jhub
Namespace:        jhub
Address:          k8s-node4.<redacted>,k8s-node5.<redacted>
Default backend:  default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
TLS:
  tls-jhub terminates jupyterhub.<redacted>
Rules:
  Host                     Path  Backends
  ----                     ----  --------
  jupyterhub.<redacted>
                           /   proxy-public:http (10.244.4.41:8000)
Annotations:               field.cattle.io/publicEndpoints:
                             [{"addresses":["",""],"port":443,"protocol":"HTTPS","serviceName":"jhub:proxy-public","ingressName":"jhub:jupyterhub","hostname":"jupyterh...
                           meta.helm.sh/release-name: jhub
                           meta.helm.sh/release-namespace: jhub
Events:                    <none>

What I understand so far about the ingress in this situation:

Traffic arrives on port 443 at k8s-node4 or k8s-node5. Some magic (controlled by the ingress controller) receives that traffic, terminates TLS, and sends the unencrypted traffic to the pod's IP at port 8000. That's the part I want to understand better.

That black box seems to at least partially involve flanel/calico and some iptables magic, and it also obviously involves nginx at some point.

Update: in the meantime, I identified what causes Kubernetes to break: restarting firewalld.

As best I can tell, that wipes out all iptables rules, not just the firewalld-generated ones.

Solution

I found the answer to my question here: https://www.stackrox.io/blog/kubernetes-networking-demystified/ There probably is a caveat that this may vary to some extent depending on which networking CNI you are using, although everything I saw was strictly related to Kubernetes itself.

I'm still trying to digest the content of this blog, and I highly recommend referring directly to that blog, instead of relying on my answer, which could be a poor retelling of the story.

Here is approximately how a package that arrives on port 443 flows.

You will need to use the command to see the tables.

iptables -t nat -vnL | less

The output of this looks rather intimidating.

The below cuts out a lot of other chains and calls to cut to the chase. In this example:

This cluster uses the CNI plugin for Calico/channel/Flannel.
Listen port is 443
The pod for the nginx-ingress-controller listens (among others) at 10.244.0.183.

In that situation, here is how the packet flows:

The packet comes into the PREROUTING chain.
The PREROUTING chain calls (among other things) the CNI-HOSTPORT-DNAT chain.
The POSTROUTING chain also calls the same chain.
The CNI-HOSTPORT-DNAT chain in turn calls several CNI-DN-xxxx chains.
The CNI-DN-xxx chains perform DNAT and change the destination address to 10.244.0.183.
The container inside the nginx-ingress-controller listens on 10.244.0.183.

There is some additional complexity involved if the pod is on a different node than the packet arrived in, and also if multiple pods are load-balanced for the same port. Load balancing seems to be handled with the iptables statistics module randomly picking one or the other iptables rule.

Internal traffic from a service to a pod follows a similar flow, but not the same.

In this example:

The service is at 10.32.0.183, port 8001
The pod is at 10.244.6.112, port 8001.

Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
...
KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0

Chain KUBE-SERVICES (2 references)
...
/* Traffic from within the cluster to 10.32.0.183:8001 */
0 0 KUBE-SVC-ZHCKOT5PFJF4PASJ  tcp  --  *      *       0.0.0.0/0            10.32.0.183          tcp dpt:8001
...

/* Mark the package */
Chain KUBE-SVC-ZHCKOT5PFJF4PASJ (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  tcp  --  *      *      !10.244.0.0/16        10.32.0.183  tcp dpt:8081
    0     0 KUBE-SEP-RYU73S2VFHOHW4XO  all  --  *      *       0.0.0.0/0            0.0.0.0/0 

/* Perform DNAT, redirecting from 10.32.0.183 to 10.244.6.12 */
Chain KUBE-SEP-RYU73S2VFHOHW4XO (1 references)                                                                                                                                                                                                                                       0     0 KUBE-MARK-MASQ  all  --  *      *       10.244.6.112         0.0.0.0/0
0     0 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0 tcp to:10.244.6.112:8081

The second part of my question regarding how to get the nodes to work reliably:

Disable firewalld.
Use Kubernetes network policies (or use Calico network policies if you are using Calico) instead.