Kafka producer dealing with lost connection to broker

With a producer configuration like below, I am creating a Singleton producer that is used throughout the application:

properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka.consul1:9092,kafka.consul2:9092");
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.ACKS_CONFIG, "1");

I am connected to a k8s hosted Kafka cluster. The broker's advertised.listeners is configured to return me the IP addresses and not host names. While normally everything works as expected, the problem occurs when Kafka is restarted, sometimes the IP address changes. Since the producer only knows about the older IP it keeps trying to connect to that host to send messages and none of the messages go through.

I observe that a org.apache.kafka.common.errors.TimeoutException exception is thrown when the send fails. Currently the messages are sent async:

producer.send(data,
                (RecordMetadata recordMetadata, Exception e) -> {
                    if (e != null) {
                        LOGGER.error("Exception while sending message to kafka", e);
                    }
                });

How should the Timeoutexception be handled now? Given that the producer is shared across the application, closing and recreating in the callback does not sound right.

Solution

According to the JavaDocs on the Callback interface the TimeoutException is a retriable Exception that could be handled by increasing the number of retries of the Producer.

In the Kafka documentation you find details on the retries configuration:

retries (Default 0): Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error. Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first.