Search code examples
rubyapache-kafkaruby-kafkakarafka

Long delays between processing of two consecutive kafka batches (using ruby/karafka consumer)


I am using karafka to read from a topic, and call an external service. Each call to external service takes roughly 300ms. And with 3 consumers (3 pods in the k8s) running in the consumer group, I expect to achieve 10 events per second. I see these loglines , which also confirm the 300ms expectation for processing each individual event. However, the overall throughput doesn't add up. Each karafka processes seems stuck for a long time between processing two batches of events.

Following instrumentation around the consume method, implies that the consumer code itself is not taking time.

https://github.com/karafka/karafka/blob/master/lib/karafka/backends/inline.rb#L12

INFO Inline processing of topic production.events with 8 messages took 2571 ms

INFO 8 messages on production.events topic delegated to xyz

However, I notice two things:

  1. When I tail logs on the 3 pods, only one of the 3 pods seems to emit logs a time. This does not make sense to me. As all partitions have enough events, and each consumer should be able to consumer in parallel.

  2. Though, the above message roughly shows 321ms (2571/8) per event, in reality I see the logs stalled for a long duration between processing of two batches. I am curious, where is that time going?

====== Edit:

There is some skew in the distribution of data across brokers - as we recently expanded our brokers from 3 to total of 6. However, none of the brokers is under cpu or disk pressure. This is a new cluster, and hardly 4-5% cpu is used at peak times.

Our data is evenly distributed in 3 partitions - I say this as the last offset is roughly the same across each partition.

Partition First
Offset
Last
Offset
Size Leader
Node
Replica
Nodes
In-sync
Replica
Nodes
Offline
Replica
Nodes
Preferred
Leader
Under-replicated
[0] 2174152 3567554 1393402 5 5,4,3 3,4,5 Yes No
1 2172222 3566886 1394664 4 4,5,6 4,5,6 Yes No
[2] 2172110 3564992 1392882 1 1,6,4 1,4,6 Yes No

However, I do see that one consumer perpetually lags behind the other two. Following table shows the lag for my consumers. There is one consumer process for each partition:

Partition First Offset Last Offset Consumer Offset Lag
0 2174152 3566320 2676120 890200
1 2172222 3565605 3124649 440956
2 2172110 3563762 3185587 378175
Combined lag 1709331

Here is a screenshot of the logs from all 3 consumers. You can notice the big difference between time spent in each invocation of consume function and interval between two adjacent invocations. Basically, i want to explain and/or reduce that waiting time. There are 100k+ events in this topic and my dummy karafka applications are able to quickly retrieve them, so kafka brokers are not an issue.

karafka consumer logs

Update after setting max_wait_time to 1 second (previously 5 second)

It seems that the issue is resolved after reducing the wait config. Now the difference between two consecutive logs is roughly equal to the time spent in consume

2021-06-24 13:43:23.425 Inline processing of topic x  with 7 messages took 2047 ms
2021-06-24 13:43:27.787 Inline processing of topic x with 11 messages took 3347 ms
2021-06-24 13:43:31.144 Inline processing of topic x with 11 messages took 3344 ms
2021-06-24 13:43:34.207 Inline processing of topic x with 10 messages took 3049 ms
2021-06-24 13:43:37.606 Inline processing of topic x with 11 messages took 3388 ms

Solution

  • There are a couple of problems you may be facing. It is a bit of a guessing from my side without more details but let's give it a shot.

    From the Kafka perspective

    Are you sure you're evenly distributing data across partitions? Maybe it is eating up things from one partition?

    What you wrote here:

    INFO Inline processing of topic production.events with 8 messages took 2571 ms

    This indicates that there was a batch of 8 processed altogether by a single consumer. This could indicate that the data is not distributed evenly.

    From the performance perspective

    There are two performance properties that can affect your understanding of how Karafka operates: throughput and latency.

    • Throughput is the number of messages that can be processed in a given time
    • Latency is the time it takes a message from the moment it was produced to it been processed.

    As far as I understand, all messages are being produced. You could try playing with the Karafka settings, in particular this one: https://github.com/karafka/karafka/blob/83a9a5ba417317495556c3ebb4b53f1308c80fe0/lib/karafka/setup/config.rb#L114

    From the logger perspective

    Logger that is being used flushes data from time to time, so you won't see it immediately but after a bit of time. You can validate this by looking at the log time.