Kafka retries config and performance implications

I am considering setting the retries mechanism to cover for network blips here and there, for which I think if the retries mechanism covers a few mins, say 2-5, that will be enough for minor network issues. As per the answer to this question and the docs the configs to be set are mainly retries, max.in.flight.requests.per.connection (recommended to be set to 1 by kafka), retry.backoff.ms and delivery.timeout.ms.

My concern is that setting max.in.flight.requests.per.connection to 1 might have performance implications? Anyone has experience with that? What is the default number of connnections a kafka producer makes with a broker cluster? I couldn't find something online about it.

Solution

`max.in.flight.requests.per.connection`

Indeed, this is one of the most important configuration parameters regarding producer's performance, specifically producer's throughput and latency. This parameter controls the maximum number of unacknowledged requests the producer will send to a certain partition on a single connection before blocking.

In other words, it will send a single request, and until acks are received, it won't send another request to the broker (for that partition). As a suggestion, if you don't require all messages to be ordered, do not set this parameter to 1.

Regarding retries and its link to this param:

Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first.

So it's not really recommended to be set to 1 by kafka; It is recommended when you require an ordered delivery. If you don't require this to happen, do not set max.in.flight.requests.per.connection to 1, as your producer's throughput will indeed be decreased.

In resume: set it to one only if you are looking for ordered delivery of event.

In this test, throughput and latency show a decent improvement when increasing max.in.flight.requests from 1 to just 2.

Throughput

Latency

`acks`

There is another param that also is involved here, together with those you already quoted, the number of acks set.

For example, acks = 0 will make both retries and max.in.flight params be completely irrelevant, as the producer will not wait for any ack from any broker, and will assume every request was successfull. Just like an UDP sender.

With acks=0:

1- retries does not take effect as there is no way to know if any failure occurred.

2- max.in.flight does not take effect as there is no possible unacknowledged requests whatsoever.

Setting the acks higher than 0, for example, acks=2, will have a direct impact on the performance as well, because for a request to be identified as successfull, 2 acks will have to be received from the cluster. This means, for example, that the blocking time of a producer which specifies only 1 in flight request will usually increase, as it will have to wait for 2 ack messages before unblocking and being able to send the next request for that partition.

`Idempotence`

There's another concept regarding your question, which is the idempotent producer. This may be the optimal option to achieve a balance between performance and efficiency.

Let's imagine you set some retries in order to guarantee a message arrived properly. The broker receives the message, and when it sents you the ack, a network error makes your producer not to receive it. If retries are set, the producer will send again the same message, creating a duplicate message in the broker.

Kafka 0.11.0 includes support for idempotent and transactional capabilities in the producer. Idempotent delivery ensures that messages are delivered exactly once to a particular topic partition during the lifetime of a single producer

An idempotent producer has a unique producer ID and uses sequence IDs for each message, which allows the broker to ensure it is committing ordered messages with no duplication, on a per partition basis.

This idempotent producer, in newer versions of Kafka clients, come as default with 5 max.in.flight.requests, increasing the performance from the "old" way to ensure delivered order. That's also the max value for the idempodent producer (from 1 to 5 is the valid range of in flight requests),It is, in resume, the best option if you require an ordered, safe pipeline, while keeping the producer's perfomance high.

The idempotent producer leads to the exactly once semantics concept, explained deeper in the link.

`Design for max.in.flight > 1 with idempotence enabled`

In resume, you should judge what your use case's requirements are. Questions like:

Is ordered delivery required?

Are duplicate messages something acceptable?

Do you value throughput and latency to a point where some messages lost/unordered is acceptable?

May the idempotent producer be the answer to your requirements, in order to achieve a balance between performance and message ordering/successfull request guarantees?

This presentation resumes more or less the impact of these configurations on the producer side, it's worth having a look.