Search code examples
dockerrabbitmqrabbitmqctl

RabbitMQ Unable to Join Cluster


I am trying to learn clustering rabbitmq nodes and I am following this tutorial as well as the official documentation.

I have 2 physical machines with rabbitmq deployed on them through docker. machine1 (192.168.1.2) is to be the cluster, and machine2 (192.168.1.3) is to join it.

When I attempt to run rabbitmqctl join_cluster [email protected] from machine2, this fails with the following message.

Clustering node [email protected] with [email protected]
Error: unable to perform an operation on node '[email protected]'. Please see diagnostics information and suggestions below.

Most common reasons for this are:

 * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
 * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
 * Target node is not running

In addition to the diagnostics info below:

 * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
 * Consult server logs on node [email protected]
 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools

DIAGNOSTICS
===========

attempted to contact: ['[email protected]']

[email protected]:
  * connected to epmd (port 4369) on 192.168.1.2
  * epmd reports node 'rabbit' uses port 25672 for inter-node and CLI tool traffic
  * TCP connection succeeded but Erlang distribution failed
  * suggestion: check if the Erlang cookie identical for all server nodes and CLI tools
  * suggestion: check if all server nodes and CLI tools use consistent hostnames when addressing each other
  * suggestion: check if inter-node connections may be configured to use TLS. If so, all nodes and CLI tools must do that
   * suggestion: see the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more


Current node details:
 * node name: '[email protected]'
 * effective user's home directory: /var/lib/rabbitmq
 * Erlang cookie hash: XXXXXXXXXXXXX

The error logs on machine1 show nothing related to such a connection attempt. I have verified the md5sum of the cookies on both docker containers and they are exactly the same. So are the permissions.

I assumed perhaps the port 4369 isn't reachable, but it is.

I am unsure what I am doing wrong. Can someone help here?

Additional information:

I am using the rabbitmq:3.85-management image. It uses Erlang/OTP 23 [erts-11.0.3].

I have been checking the troubleshooting guide, but I am unsure what seems wrong here. Please let me know if I can provide more information.


Solution

  • So thanks to @NeoAnderson and @José M, I was able to understand what happened.

    The containers running RMQ need to be accessible via the hostname that Erlang uses within the service, across the network. Since the hostname of the containers were not accessible in a container on another machine, this clustering failed.

    A simple fix would be to edit the /etc/hosts file on the containers so that it would point the IP to the "leader" node.

    I was just doing this to avoid installing RMQ and not because I thought this was the best way to do this. Alternately, docker swarm or k8s would have provided the right networking for me.

    But the root cause was definitely the nodename problem.