apache-kafka kafka-producer-api spring-kafka

request.timeout.ms and Spring Kafka Synchronous Event Publishing with KafkaTemplate

I'm a bit confused about the best practice for configuring the timeout of an event published synchronously through Spring Kafka.The Spring Kafka documentation provides an example using ListenableFuture's get(SOME_TIME, TimeUnit) to enable synchronous publishing of events with a timeout of SOME_TIME. (duplicated below for reference).

public void sendToKafka(final MyOutputData data) {
    final ProducerRecord<String, String> record = createRecord(data);

    try {
        template.send(record).get(10, TimeUnit.SECONDS);
        handleSuccess(data);
    }
    catch (ExecutionException e) {
        handleFailure(data, record, e.getCause());
    }
    catch (TimeoutException | InterruptedException e) {
        handleFailure(data, record, e);
    }
}

On the other hand, I was looking at Kafka's Producer Configuration Documentation and saw that Kafka had a configuration for request.timeout.ms, which was responsible for the below setting in Kafka.

The configuration controls the maximum amount of time the client will wait for the response of a request. If the response is not received before the timeout elapses the client will resend the request if necessary or fail the request if retries are exhausted.

Would it make more sense to configure template.send(...).get(...) with some time unit (e.g., 10 seconds/10,000 ms, as given in the example from Spring Kafka above), or would the better approach be to configure request.timeout.ms (along with retries) to emulate this behavior through Kafka internally and make a no-args call to get()?

Solution

It's never a good idea to use the no-args get(); you could hang forever if there was some bug in the client code.

The two timeouts are really different.

The future get() is to get the result of the send (success or failure).

If your producer configuration can succeed after the get() times out then you can get duplicates (assuming you are retrying at the application level after failure).

I suppose the "best practice" would be to use a get() timeout that is greater than retries * request.timeout.ms but that could be a long time. But it will ensure that you get the real result of the send. Getting a timeout in that situation should be considered an anomaly that needs investigation.