Search code examples
ruby-on-railsparallel-processingproducer-consumerkarafka

Scaling karafka consumer instances for high traffic kafka topics


I'm using Rails with Karafka gem. In one of the topic the traffic will be huge, so i added 10 partitions to the topic, also i set the Karafka::App.config.concurrency to 10. But this isn't giving me a huge difference in time. To further scale our consumer processing, I'm considering adding separate consumer instances in our staging environment.

Can I add separate consumer instances in Karafka to further reduce the processing time?

Karafka gem version: 2.1.11, Rails version: 7.0.5

The karafka.rb file looks something like:

class KarafkaApp < Karafka::App
  kafka_config = Settings.kafka
  max_payload_size = 7_000_000 # Setting max payload size as 7MB

  t1_topic = kafka_config['topics']['t1_topic']
  t2_topic = kafka_config['topics']['t2_topic']

  setup do |config|
    config.kafka = {
      'bootstrap.servers': ENV['KAFKA_BROKERS_SCRAM'],
      'max.poll.interval.ms': 1200000
    }
    config.client_id = kafka_config['client_id']
    config.concurrency = 10
    # Recreate consumers with each batch. This will allow Rails code reload to work in the
    # development mode. Otherwise Karafka process would not be aware of code changes
    config.consumer_persistence = !Rails.env.development?

    config.producer = ::WaterDrop::Producer.new do |producer_config|
      # Use all the settings already defined for consumer by default
      producer_config.kafka = ::Karafka::Setup::AttributesMap.producer(config.kafka.dup)

      # Alter things you want to alter
      producer_config.max_payload_size = max_payload_size
      producer_config.kafka[:'message.max.bytes'] = max_payload_size
    end
  end

  routes.draw do
    topic t1_topic.to_sym do
      consumer C1Consumer
    end

    topic t2_topic.to_sym do
      config(
        partitions: Karafka::App.config.concurrency
      )

      consumer C2Consumer
    end
  end
end

Solution

  • Karafka author here.

    Your description needs to include some critical information, especially about your data and processing nature. Without that, it isn't easy to provide specific advice, but I'll do my best.

    First, be aware that heavily IO-based operations may only achieve a tenfold increase (or similar) per process without fully utilizing the CPU in the case you have described. For instance, processing one message per second means a best-case scenario of 10 messages per second, even if everything works as expected. There are ways to further increase this, and I will give you some explanation below.

    Understanding Karafka's parallel processing and ordering guarantees is crucial. To prevent capability reduction, parallel data polling/receiving and parallel work distribution need to be synchronized.

    Karafka can handle up to 100k messages per second under optimal conditions.

    Another point is how Karafka manages multiple partitions within a single process. By default, it reduces Kafka connection numbers by round-robin partition data into a single queue. Excessive prebuffering per partition can hinder parallel processing since a single batch from the queue may not offer enough data from multiple partitions for effective parallelism. This misunderstanding often arises from comparisons with Sidekiq. However, Karafka offers several mitigation strategies.

    Regarding adding 10 partitions and setting Karafka::App.config.concurrency to 10 without significant improvement, this is expected if prebuffering is extensive. In such cases, Karafka cannot utilize all threads efficiently, especially when catching up with a backlog. This might change with a smaller lag, leading to better data distribution and increased parallelism.

    Adding separate consumer instances in Karafka can indeed reduce processing times. Karafka supports various methods to enhance parallelism and performance, depending on your work's nature, topic structure, and count. Here are some options:

    Karafka's concurrency model is thoroughly explained here: https://karafka.io/docs/Concurrency-and-Multithreading/.

    This overview should give you a clearer path to enhancing your Karafka setup.

    P.S. We have a slack where you can get real-time answers from me and other heavy Karafka users. People use it in scale of 100s of thousands of messages per second and it works well when properly configured.

    P.S. 2. If you plan to bet your business on Karafka at some point, it may be worth investing in the Pro offering which includes application-specific support and architectural consultation + a lot of useful features.