amazon-ec2 kubernetes netfilter flannel kube-proxy

Kubernetes - Connection tracking does not mangle packages back to the original destination IP (DNAT)

We have a Kubernetes cluster setup using AWS EC2 instances which we created using KOPS. We are experiencing problems with internal pod communication through kubernetes services (which will load balance traffic between destination pods). The problem emerges when the source and destination pod are on the same EC2 instance (node). Kubernetes is setup with flannel for internode communication using vxlan, and kubernetes services are managed by kube-proxy using iptables.

In a scenario where:

PodA running on EC2 instance 1 (ip-172-20-121-84, us-east-1c): 100.96.54.240
PodB running on EC2 instance 1 (ip-172-20-121-84, us-east-1c): 100.96.54.247
ServiceB (service where PodB is a possible destination endpoint): 100.67.30.133

If we go inside PodA and execute "curl -v http://ServiceB/", no answer is received and finally, a timeout is produced.

When we inspect the traffic (cni0 interface in instance 1), we observe:

PodA sends a SYN package to ServiceB IP
The package is mangled and the destination IP is changed from ServiceB IP to PodB IP

Conntrack registers that change:

root@ip-172-20-121-84:/home/admin# conntrack -L|grep 100.67.30.133
tcp      6 118 SYN_SENT src=100.96.54.240 dst=100.67.30.133 sport=53084 dport=80 [UNREPLIED] src=100.96.54.247 dst=100.96.54.240 sport=80 dport=43534 mark=0 use=1

PodB sends a SYN+ACK package to PodA
The source IP for the SYN+ACK package is not reverted back from the PodB IP to the ServiceB IP
PodA receives a SYN+ACK package from PodB, which was not expected and it send back a RESET package
PodA sends a SYN package to ServiceB again after a timeout, and the whole process repeats

Here the tcpdump annotated details:

root@ip-172-20-121-84:/home/admin# tcpdump -vv -i cni0 -n "src host 100.96.54.240 or dst host 100.96.54.240"
TCP SYN:
15:26:01.221833 IP (tos 0x0, ttl 64, id 2160, offset 0, flags [DF], proto TCP (6), length 60)
    100.96.54.240.43534 > 100.67.30.133.80: Flags [S], cksum 0x1e47 (incorrect -> 0x3e31), seq 506285654, win 26733, options [mss 8911,sackOK,TS val 153372198 ecr 0,nop,wscale 9], length 0
15:26:01.221866 IP (tos 0x0, ttl 63, id 2160, offset 0, flags [DF], proto TCP (6), length 60)
    100.96.54.240.43534 > 100.96.54.247.80: Flags [S], cksum 0x36d6 (incorrect -> 0x25a2), seq 506285654, win 26733, options [mss 8911,sackOK,TS val 153372198 ecr 0,nop,wscale 9], length 0

Level 2:
15:26:01.221898 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 100.96.54.240 tell 100.96.54.247, length 28
15:26:01.222050 ARP, Ethernet (len 6), IPv4 (len 4), Reply 100.96.54.240 is-at 0a:58:64:60:36:f0, length 28

TCP SYN+ACK:
15:26:01.222151 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    100.96.54.247.80 > 100.96.54.240.43534: Flags [S.], cksum 0x36d6 (incorrect -> 0xc318), seq 2871879716, ack 506285655, win 26697, options [mss 8911,sackOK,TS val 153372198 ecr 153372198,nop,wscale 9], length 0

TCP RESET:
15:26:01.222166 IP (tos 0x0, ttl 64, id 32433, offset 0, flags [DF], proto TCP (6), length 40)
    100.96.54.240.43534 > 100.96.54.247.80: Flags [R], cksum 0x6256 (correct), seq 506285655, win 0, length 0

TCP SYN (2nd time):
15:26:02.220815 IP (tos 0x0, ttl 64, id 2161, offset 0, flags [DF], proto TCP (6), length 60)
    100.96.54.240.43534 > 100.67.30.133.80: Flags [S], cksum 0x1e47 (incorrect -> 0x3d37), seq 506285654, win 26733, options [mss 8911,sackOK,TS val 153372448 ecr 0,nop,wscale 9], length 0
15:26:02.220855 IP (tos 0x0, ttl 63, id 2161, offset 0, flags [DF], proto TCP (6), length 60)
    100.96.54.240.43534 > 100.96.54.247.80: Flags [S], cksum 0x36d6 (incorrect -> 0x24a8), seq 506285654, win 26733, options [mss 8911,sackOK,TS val 153372448 ecr 0,nop,wscale 9], length 0
15:26:02.220897 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    100.96.54.247.80 > 100.96.54.240.43534: Flags [S.], cksum 0x36d6 (incorrect -> 0x91f0), seq 2887489130, ack 506285655, win 26697, options [mss 8911,sackOK,TS val 153372448 ecr 153372448,nop,wscale 9], length 0
15:26:02.220915 IP (tos 0x0, ttl 64, id 32492, offset 0, flags [DF], proto TCP (6), length 40)
    100.96.54.240.43534 > 100.96.54.247.80: Flags [R], cksum 0x6256 (correct), seq 506285655, win 0, length 0

The relevant iptable rules (automatically managed by kube-proxy) on instance 1 (ip-172-20-121-84, us-east-1c):

-A INPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES

-A KUBE-SERVICES ! -s 100.96.0.0/11 -d 100.67.30.133/32 -p tcp -m comment --comment "prod/export: cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 100.67.30.133/32 -p tcp -m comment --comment "prod/export: cluster IP" -m tcp --dport 80 -j KUBE-SVC-3IL52ANAN3BQ2L74

-A KUBE-SVC-3IL52ANAN3BQ2L74 -m comment --comment "prod/export:" -m statistic --mode random --probability 0.10000000009 -j KUBE-SEP-4XYJJELQ3E7C4ILJ
-A KUBE-SVC-3IL52ANAN3BQ2L74 -m comment --comment "prod/export:" -m statistic --mode random --probability 0.11110999994 -j KUBE-SEP-2ARYYMMMNDJELHE4
-A KUBE-SVC-3IL52ANAN3BQ2L74 -m comment --comment "prod/export:" -m statistic --mode random --probability 0.12500000000 -j KUBE-SEP-OAQPXBQCZ2RBB4R7
-A KUBE-SVC-3IL52ANAN3BQ2L74 -m comment --comment "prod/export:" -m statistic --mode random --probability 0.14286000002 -j KUBE-SEP-SCYIBWIJAXIRXS6R
-A KUBE-SVC-3IL52ANAN3BQ2L74 -m comment --comment "prod/export:" -m statistic --mode random --probability 0.16667000018 -j KUBE-SEP-G4DTLZEMDSEVF3G4
-A KUBE-SVC-3IL52ANAN3BQ2L74 -m comment --comment "prod/export:" -m statistic --mode random --probability 0.20000000019 -j KUBE-SEP-NXPFCT6ZBXHAOXQN
-A KUBE-SVC-3IL52ANAN3BQ2L74 -m comment --comment "prod/export:" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-7DUMGWOXA5S7CFHJ
-A KUBE-SVC-3IL52ANAN3BQ2L74 -m comment --comment "prod/export:" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-LNIY4F5PIJA3CQPM
-A KUBE-SVC-3IL52ANAN3BQ2L74 -m comment --comment "prod/export:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-SLBETXT7UIBTZCPK
-A KUBE-SVC-3IL52ANAN3BQ2L74 -m comment --comment "prod/export:" -j KUBE-SEP-FMCOTKNLEICO2V37

-A KUBE-SEP-OAQPXBQCZ2RBB4R7 -s 100.96.54.247/32 -m comment --comment "prod/export:" -j KUBE-MARK-MASQ
-A KUBE-SEP-OAQPXBQCZ2RBB4R7 -p tcp -m comment --comment "prod/export:" -m tcp -j DNAT --to-destination 100.96.54.247:80

-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE

This is the service definition:

root@adsvm010:/yamls# kubectl describe service export
Name:              export
Namespace:         prod
Labels:            <none>
Annotations:       <none>
Selector:          run=export
Type:              ClusterIP
IP:                100.67.30.133
Port:              <unset>  80/TCP
TargetPort:        80/TCP
Endpoints:         100.96.5.44:80,100.96.54.235:80,100.96.54.247:80 + 7 more...
Session Affinity:  None
Events:            <none>

If instead of the service we use directly PodB IP (so there is no need to mangle packages), the connection works.

If we use the service but the randomly selected destination pod is running in a different instance, then the connection tracking mechanism works properly and it mangles the package back so that PodA sees the SYN+ACK package as it expected it (coming from ServiceB IP). In this case, traffic goes through cni0 and flannel.0 interfaces.

This behavior started some weeks ago before we were not observing any problems (over a year) and we do not recall any major change to the cluster setup or to the pods we are running. Does anybody have any idea that would explain why the SYN+ACK package is not mangled back to the expected src/dst IPs?

Solution

I finally found the answer. The cni0 interface is in bridge mode with all the pod virtual interfaces (one veth0 per pod running on that node):

root@ip-172-20-121-84:/home/admin# brctl show
bridge name bridge id       STP enabled interfaces
cni0        8000.0a5864603601   no      veth05420679
                                        veth078b53a1
                                        veth0a60985d
...


root@ip-172-20-121-84:/home/admin# ip addr
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP group default qlen 1000
    link/ether 0a:58:64:60:36:01 brd ff:ff:ff:ff:ff:ff
    inet 100.96.54.1/24 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::1c66:76ff:feb6:2122/64 scope link
       valid_lft forever preferred_lft forever

The traffic that goes from/to the bridged interface to/from some other interface is processed by netfilter/iptables, but the traffic that does not leave the bridged interface (e.g. from one veth0 to another, both belonging to the same bridge) is NOT processed by netfilter/iptables.

In the example I exposed in the question, PodA (100.96.54.240) sends a SYN package to ServiceB (100.67.30.133) which is not in the cni0 subnet (100.96.54.1/24) so this package will not stay in the bridged cni0 interface and iptable processes it. That is why we see that the DNAT happened and it got registered in the conntrack. But if the selected destination pod is in the same node, for instance PodB (100.96.54.247), then PodB sees the SYN package and responses with a SYN+ACK where the source is 100.96.54.247 and the destination is 100.96.54.240. These are IPs inside the cni0 subnet and do not need to leave it, hence netfilter/iptables does not process it and does not mangle back the package based on conntrack information (i.e., the real source 100.96.54.247 is not replaced by the expected source 100.67.30.133).

Fortunately, there is the bridge-netfilter kernel module that can enable netfilter/iptables to process traffic that happens in the bridged interfaces:

root@ip-172-20-121-84:/home/admin# modprobe br_netfilter
root@ip-172-20-121-84:/home/admin# cat /proc/sys/net/bridge/bridge-nf-call-iptables
1

To fix this in a Kubernetes cluster setup with KOPS (credits), edit the cluster manifest with kops edit cluster and under spec: include:

hooks:
- name: fix-bridge.service
  roles:
  - Node
  - Master
  before:
  - network-pre.target
  - kubelet.service
  manifest: |
    Type=oneshot
    ExecStart=/sbin/modprobe br_netfilter
    [Unit]
    Wants=network-pre.target
    [Install]
    WantedBy=multi-user.target

This will create a systemd service in /lib/systemd/system/fix-bridge.service in your nodes that will run at startup and it will make sure the br_netfilter module is loaded before kubernetes (i.e., kubelet) starts. If we do not do this, what we experienced with AWS EC2 instances (Debian Jessie images) is that sometimes the module is loaded during startup and sometimes it is not (I do not know why there such a variability), so depending on that the problem may manifest itself or not.