Search code examples
rabbitmqerlangintegrationmessagingfederation

RabbitMQ federation queue got auto-deleted when restarted federation link after a problem


We have two RabbitMQ clusters. One is used as upstream and the other is used as downstream. Each cluster has 3 nodes. We receive messages being published on the orders exchange on the upstream cluster to the downstream cluster using RabbtitMQ - Federation.

This was working fine since the last 6 months. Suddenly on 10/4 we received a cluster partition error on the upstream cluster. It was due to one of the systems in the cluster getting hung for more than a minute. We saw recurring information about this in the system logs and temporarily brought that system down. The upstream cluster is now running as a two node cluster. it was not noticed then, but on 10/8 we realized that on the downstream node we are not getting any order messages since 10/4.

Upon further investigation, I found that the federation link on the downstream cluster is still showing as running but there are 87000+ messages accumulated on the auto created "federation" queue on the upstream cluster. In order to retrieve those messages, I restarted the federation link from the downstream cluster. But unexpectedly, I saw the "federation" queue getting deleted and recreated on the portal, taking those 87000+ messages also into the darkness of space. We started getting any new messages from that time onwards, but the old messages were just lost.

Before putting the solution into place, we did some POC on this by shutting down both the clusters one by one. Each time, the federation queue was able to retain the persistent messages. And whenever both the clusters were in the right state, the downstream federation link was able to fetch those messages. So, we came to the conclusion that whenever the "federation" queue is available on the upstream node, the "federation link" on the downstream side should pick the messages; and hence we never anticipated this issue.

Neither we set x-expire and x-message-ttl parameters on the federation configuration nor the app sets these when publishing the messages. We only use "trust-user-id": false, URI (all 3 cluster nodes) and exchange name in the federation configuration. Rest all is default which means "x-expire" on the federation queue should be set to "never" (which should cause the queue to live forever unless the federation link is deleted in the downstream side). Our messages are also published as persistent.

Only the logs on the upstream system has the relevant information about this problem during the federation link restart. The snippet is mentioned below. It says that the queue is initialized from "0" depth.

I want to ask the following questions -

  1. Is our understanding about federation is correct (in context of what is mentioned above)? We do not have a way to reproduce the problem. But does someone knows the cause for it or any missing setting at our end?
  2. With each "federation link" restart on the downstream side, does the "federation" queue always gets recreated on the upstream side?
  3. Is there a command to see the creation time stamp of objects like queues and exchanges?
  4. What best practices or techniques we can follow to ensure that the federation queue is not deleted?

RabbitMQ versions: - Upstream: RabbitMQ 3.6.1, Erlang R16B03-1 - Downstream: RabbitMQ 3.6.15, Erlang 20.3.4

Log snippet from the upstream rabbitmq node. No other relevant log was found.

++++++++++++++++++++++++++++++++++++++++++++++++++++++

=WARNING REPORT==== 8-Oct-2018::14:57:38 ===

closing AMQP connection <0.1688.0> (:51364 -> :5672):

client unexpectedly closed TCP connection

=INFO REPORT==== 8-Oct-2018::14:58:07 ===

accepting AMQP connection <0.521.123> (:46659 -> :5672)

=INFO REPORT==== 8-Oct-2018::14:58:08 ===

Mirrored queue 'federation: order.exch -> ' in vhost 'production': Adding mirror on node 'rabbit@upstream-hostname': <7719.25968.3282>

=INFO REPORT==== 8-Oct-2018::14:58:08 ===

Mirrored queue 'federation: order.exch -> ' in vhost 'production': Synchronising: 0 messages to synchronise

=INFO REPORT==== 8-Oct-2018::14:58:08 ===

Mirrored queue 'federation: order.exch -> ' in vhost 'production': Synchronising: batch size: 4096

=INFO REPORT==== 8-Oct-2018::14:58:08 ===

Mirrored queue 'federation: order.exch -> ' in vhost 'production': Synchronising: all slaves already synced

=INFO REPORT==== 8-Oct-2018::14:58:09 ===

accepting AMQP connection <0.567.123> (:46659 -> :5672)

++++++++++++++++++++++++++++++++++++++++++++++++++++

Please let me know if you need more information from my end to answer these questions.


Solution

  • Thanks all to those who looked at this query. Answering this as it might help someone else in future.

    We opened a case with the vendor. They tried to replicate this in their labs but were not able to. At this time we are not sure what caused the problem. In order to see when the queue was created another plugin event-exchange needs to be enabled and listened to.

    Rabbitmq looks like a product which is excellent for a smaller apps/micro services, but not that good when it comes to distributed messaging/clustering. In the past as well, we have seen messages getting lost due to clustering problems. Thats just my opinion based on my experience so far, I have been wrong before.