Search code examples
rabbitmqamqpmessagebrokerrabbitmq-exchange

Rabbitmq cluster crashing when creating queues


Hello I have a question that affter looking around for about 2 days I was not able to solve, so I will write it here, as clear as possible so it may help others too.

The scenario is:

  1. We have an application that will handle about 200k devices thought amqp protocol using a Rabbitmq cluster.
  2. We thought of having 1 Exchange with 200k queues with around 6 "routing key" each for the devices.
  3. These queues needs to be durable and lazy, as we don't want to loose any message.
  4. We are using mirror nodes as we need HA.

The test:

  1. I created a cluster with 5 nodes, and replication 2
    "definition": {
            "ha-mode": "exactly",
            "ha-params": 2,
            "ha-sync-mode": "automatic",
            "ha-sync-batch-size": 1
          }
  1. I created 50k durable, lazy, queues with the routing keys also.
def create_one_queue(queue_name, threadName, channel):
    channel.queue_declare(queue=queue_name, durable=True, arguments={'x-queue-mode': 'lazy'})
    for bind in BINDINGS:
        channel.queue_bind(exchange=EXCHANGE, queue=queue_name, routing_key=bind.format(queue_name))
    print("[{}]Created Queue {}".format(threadName, queue_name))

def create_queues(threadName, base):
    channel = get_channel()
    for i in range(0, 1000):
        try:
            queue_name = str(i + base)
            create_one_queue(queue_name, threadName, channel)
        except Exception as e:
            print(e)

enter image description here 3. When I tried to keep growing and arrive to 200k nodes start to crash without running out of resources.

Links

I already took a lok to the followings posts:

https://www.rabbitmq.com/ha.html#ways-to-configure

https://www.cloudamqp.com/blog/2018-01-09-part3-rabbitmq-best-practice-for-high-availability.html

RabbitMQ - How many queues RabbitMQ can handle on a single server?

https://serverfault.com/questions/378165/rabbitmq-reasonable-performance-scale-expectations

http://rabbitmq.1065348.n5.nabble.com/How-many-queues-can-one-broker-support-td21539.html

https://www.quora.com/RabbitMQ/Can-rabbitMQ-or-zeroMQ-handle-1mil-queues

but I see contradictions (cloudamqp suggest to use few queues, but in other places saids you may arrive to 1M queues)

Questions

  1. How is possible the cluster start to crash if I am not getting out of resources?
  2. Is my approach wrong?
  3. Any advice to improve my cluster configuration?

Thanks a lot


Solution

  • Ok I will answer my question with the results of my findings so far:

    1) As I was usign Kubernetes and Helm to deploy the cluster, I was putting to much memory pressure in the pods, leaving no free space for garbage collector. https://www.rabbitmq.com/memory-use.html#queue-memory-usage-gc

    High memory watermark blocks publishers and prevents new messages from being enqueued. Since garbage collection can double the memory used by a queue, it is unsafe to set the high memory watermark above 0.5. The default high memory watermark is set to 0.4 since this is safer as not all memory is used by queues. This is entirely workload specific, which differs across RabbitMQ deployments.

    2) Seems ok.

    3) in order to create 200k durable and lazy queues, I had to use a cluster of 10 nodes each one with 8 vCPU and 30 GB RAM.

    note: I will keep this answer up to date as I tune my cluster.