Data Loss in kafka producer when a Kafka Broker goes down and comes back

I am facing some data loss whenever a Kafka broker goes down and joins back. I guess a rebalance is being triggered whenever the broker joins the cluster and at this point I observed some errors in my Kafka Producer.

The producer writes to a Kafka topic with 40 partitions and the below are the sequence of the logs I see whenever the rebalance is triggered.

[WARN ] 2019-06-05 20:39:08 WARN  Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133054 on topic-partition test_ve-17, retrying (2 attempts left). Error: NOT_LEADER_FOR_PARTITION
...
...
[WARN ] 2019-06-05 20:39:31 WARN  Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133082 on topic-partition test_ve-12, retrying (1 attempts left). Error: NOT_ENOUGH_REPLICAS
...
...
[ERROR] 2019-06-05 20:39:43 ERROR xyz:297 - org.apache.kafka.common.errors.NotEnoughReplicasException: Messages are rejected since there are fewer in-sync replicas than required.
...
...
[WARN ] 2019-06-05 20:39:48 WARN  Sender:521 - [Producer clientId=producer-1] Got error produce response with correlation id 133094 on topic-partition test_ve-22, retrying (1 attempts left). Error: NOT_ENOUGH_REPLICAS
[ERROR] 2019-06-05 20:39:53 ERROR Sender:604 - [Producer clientId=producer-1] The broker returned org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received an out of order sequence number for topic-partition test_ve-37 at offset -1. This indicates data loss on the broker, and should be investigated.
[INFO ] 2019-06-05 20:39:53 INFO  TransactionManager:372 - [Producer clientId=producer-1] ProducerId set to -1 with epoch -1
[ERROR] 2019-06-05 20:39:53 ERROR xyz:297 - org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received an out of order sequence number
...
...
RROR] 2019-06-05 20:39:53 ERROR xyz:297 - org.apache.kafka.common.errors.OutOfOrderSequenceException: Attempted to retry sending a batch but the producer id changed from 417002 to 418001 in the mean time. This batch will be dropped.

Some of the Kafka config we have is

acks = all
min.insync.replicas=2
unclean.leader.election.enable=false
linger.ms=250
retries = 3

I am calling the flush() after producing every 3000 records. Is there anything I am doing wrong, any pointers please?

Solution

Let me assume a few things, you have 3 Kafka broker nodes and replication factor for all topics is also 3. You don't create topics on fly.

As you have given:

acks = all
min.insync.replicas=2
unclean.leader.election.enable=false

In that scenario, if both of the replicas in sync goes down, you will definitely drop data. Since the last remaining replica is not eligible to elect as leader for your cluster because unclean.leader.election.enable=false and there is no leader to receive the send request. Since you set linger.ms= 250 one of the insync replica came back as alive within that short time and again selected as topics leader, you will avoid data loss. But caveat is linger.ms works along with batch.size. If you set very low value for batch.size and number of messages to sent reached batch size producer may not wait until reach linger.ms settings.

So one of the definite changes I recommend is to increase the retries. Check your configuration for parameter request.timeout.ms. Find your average time taken by a broker to come back after shutdown. Your retries should cover the time taken by broker to come alive in case there is a causality. This will definitely help you to avoid the data loss if all other trade offs are in place to decrease the chance of data loss.