DC/OS virtual network doesn't work across agents

I have successfully created host and bridge mode marathon apps without issue, and used l4lb and marathon-lb to host them. That all works without a problem.

I'm now trying to use USER mode networking, using the default "dcos" 9.0.0.0/8 network. In this mode my apps can only talk to other containers on the same agent. The host OS's can only talk to containers hosted on themselves. It appears that nodes can't route traffic between each other on the virtual network.

For testing I'm using the docker "nginx:alpine" container, with 2 instances, on different hosts. Their IPs are 9.0.6.130 and 9.0.3.130. No L4LB or Marathon-LB config, no service endpoints, no ports exposed on the host network. Basically:

"container": {
    "docker": {
      "image": "nginx:alpine",
      "forcePullImage": false,
      "privileged": false,
      "network": "USER"
    }
  },
  "labels": null,
  "ipAddress": {
    "networkName": "dcos"
  },
}

in a shell in one of them, I have:

/ # ip addr list | grep 'inet 9'
inet 9.0.6.130/25 scope global eth0

/ # nc -vz 9.0.6.130:80
9.0.6.130:80 (9.0.6.130:80) open

/ # nc -vz 9.0.3.130:80
nc: 9.0.3.130:80 (9.0.3.130:80): Operation timed out

/ # traceroute to 9.0.3.130 (9.0.3.130), 30 hops max, 46 byte packets
traceroute to 9.0.3.130 (9.0.3.130), 30 hops max, 46 byte packets
 1  9.0.6.129 (9.0.6.129)  0.006 ms  0.002 ms  0.001 ms
 2  44.128.0.4 (44.128.0.4)  0.287 ms  0.272 ms  0.100 ms
 3  *  *  *
 4  *  *  *

From the other side:

/ # ip addr list | grep 'inet 9'
inet 9.0.3.130/25 scope global eth0
/ # nc -vz 9.0.3.130:80
9.0.3.130:80 (9.0.3.130:80) open
/ # nc -vz 9.0.6.130:80
/ # traceroute 9.0.6.130
traceroute to 9.0.6.130 (9.0.6.130), 30 hops max, 46 byte packets
 1  9.0.3.129 (9.0.3.129)  0.005 ms  0.003 ms  0.001 ms
 2  44.128.0.7 (44.128.0.7)  0.299 ms  0.241 ms  0.098 ms
 3  *  *  *
 4  *  *  *

Interestingly, I can ping what I think should be the next (virtual) hop, and all intermediate hops, despite traceroute not showing it. The only thing that doesn't ping is the end container's virtual IP. (These are from within one of the containers)

64 bytes from 44.128.0.7: seq=0 ttl=63 time=0.269 ms
64 bytes from 44.128.0.4: seq=0 ttl=64 time=0.094 ms
64 bytes from 9.0.3.129: seq=0 ttl=64 time=0.072 ms
64 bytes from 9.0.6.129: seq=0 ttl=63 time=0.399 ms
PING 9.0.6.130 (9.0.6.130): 56 data bytes (no response)

Any ideas?

Solution

Figured this out with help from the DC/OS community mailing list.

RHEL7 installs firewalld by default, which DC/OS needs disabled. I had done that, but that still leaves the FORWARD policy as DROP until the node is rebooted. DC/OS's firewall manipulation only changes the rules, not the default policy.

This fixes it:

iptables -P FORWARD ACCEPT

That's the default on reboot anyway unless specified somewhere (like firewalld), so it should persist across reboots without any further action.