Search code examples
dockeroverlayiptablesdocker-swarmopenvpn

Docker swarm overlay network with vxlan routing over openvpn


I have setup a docker swarm with 3 nodes (docker 18.03). These nodes use an overlay network to communicate.

node1:  
  laptop   
  host tun0 172.16.0.6 --> openvpn -> nat gateway
  container n1
    ip = 192.169.1.10  

node2: 
  aws ec2
  host eth2 10.0.30.62
  container n2
    ip = 192.169.1.9

node3:
  aws ec2
  host eth2 10.0.140.122
  container n3
    ip = 192.169.1.12

nat-gateway:
  aws ec2
  tun0 172.16.0.1 --> openvpn --> laptop
  eth0 10.0.30.198

The scheme is partly working:
1. Containers can ping eachother using name (n1,n2,n3)
2. Docker swarm commands are working, services can be deployed

The overlay is partly working. Some nodes cannot communicate with each other either using tcp/ip or udp. I tried all combinations of the 3 nodes with udp and tcp/ip:

enter image description here

I did a tcpdump on the nat gateway to monitor overlay vxlan network activity (port 4789):

tcpdump -l -n -i eth0 "port 4789"
tcpdump -l -n -i tun0 "port 4789"

Then I tried tcp/ip communication from node2 to node3. On node3: nc -l -s 0.0.0.0 -p 8999 On node1: telnet 192.169.1.12 8999

Node1 will then try to connect to node3. I see packets coming in on the nat-gateway over the tun0 interface:

enter image description here

on the nat-gateway eth0 interface:

enter image description here

it seems that the nat-gateway is not sending replies back over the tun0 interface.

The iptables configuration the nat-gateway

enter image description here

The routing of the nat-gateway

enter image description here

Can you help me solve this issue?


Solution

  • I have been able to fix the issue using the following configuration on the NAT gateway:

    enter image description here

    and

    enter image description here

    1. No masquerading of 172.16.0.0/22 is needed. All the workers and managers will route their traffic for 172.16.0.0/22 via the NAT gateway, and it knows how to send the packets over tun0.
    2. Masquerading of eth0 was just wrong...

    All the containers can now ping and establish tcp/ip connections to each other.