RabbitMQ cluster fails when one node is not reachable

I created a RabbitMQ cluster via Docker and Docker Cloud. I am running two RabbitMQ container on two separate nodes (both hosted on AWS).

The output of rabbitmqctl cluster_status is:

Cluster status of node 'rabbit@rabbitmq-cluster-2' ...
[{nodes,[{disc,['rabbit@rabbitmq-cluster-1','rabbit@rabbitmq-cluster-2']}]},
 {running_nodes,['rabbit@rabbitmq-cluster-1','rabbit@rabbitmq-cluster-2']},
 {cluster_name,<<"rabbit@rabbitmq-cluster-1">>},
 {partitions,[]}]

However, when I am stopping one container/node, then my messages cannot get delievered and get queued in .dlx

I am using senecajs with NodeJS.

Did anybody have the same problems and can point me into a direction?

Solution

To answer my own question:

The problem was that Docker, after starting, caches the DNS and is not able to connect to a new one. So if one cluster fails, Docker still tries to connect to the one, instead of trying a new one.

The solution was to write my own function when connecting to the RabbitMQ. I first check with net.createConnection if the host is online. If yes, I connect to it, if not I try a different one.

Every time a RabbitMQ node is down, my service fails, restarts and calls the "try this host" function.