docker swarm - connections from wildfly to postgres randomly hang

I'm experiencing a weird problem when deploying a docker stack (compose file).

I have a three node docker swarm - master and two workers. All machines are CentOS 7.5 with kernel 3.10.0 and docker 18.03.1-ce.

Most things run on the master, one of which is a wildfly (v9.x) application server. On one of the workers is a postgres database. After deploying the stack things work normally, but after a while (or maybe after a specific action in the web app) request start to hang. Running netstat -ntp inside the wildfly container shows 52 bytes stuck in the Send-q:

tcp        0     52 10.0.0.72:59338         10.0.0.37:5432          ESTABLISHED -

On the postgres side the connection is also in ESTABLISHED state, but the send and receive queues are 0. It's always exactly 52 bytes. I read somewhere that ACK packets with timestamps are also 52 bytes. Is there any way I can verify that? We have the following sysctl tunables set:

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_timestamps = 0

The first three were needed because of this.

All services in the stack are connected to the same default network that docker creates. Now if I move the postgres service to be on the same host as the wildfly service the problem doesn't seem to surface or if I declare a separate network for postgres and add it only to the services that need the database (and the database of course) the problem also doesn't seem to show.

Has anyone come across a similar issue? Can anyone provide any pointers on how I can debug the problem further?

Solution

Turns out this is a known issue with pooled connections in swarm with services on different nodes.

Basically the workaround is to set the above tuneables + enable tcp keepalive on the socket. See here and here for more details.