apache-kafka kafka-consumer-api producer-consumer

Optimal way of disaster tolerance during Kafka message consumption

I currently have a service written to consume messages from a Kafka topic, do some computation that's it, I currently have the design where the service will do computation in batches (i.e. 1000 messages per batch) and emit the offset after that batch is done as latency is not a problem. However, I realized that if my service were to process 500 messages, crash then restart, it would potentially re-compute the 500 messages again as it has not sent out an offset to the Kafka topic and it is unaware of where the consumer is at. How should I design a process where I can guarantee exactly once compute without setting the offset every single message? Once again, latency is not a problem but I don't want to sacrifice so much by setting an offset every single time.

Solution

Kafka can support transactional processing, so I would start with that.

But if you don't commit offset % 1000 records and only process batch[0..499], for example, then you need some downstream logic, outside the scope of Kafka, to prevent you from handling those records again. For example, use Redis to store some record ID, and do a fast hash lookup to see if record has been processed or not. Sure, this is a point-of-failure, but this is the tradeoff for writing consumer code that doesn't have idempotency.

A restarted Kafka consumer will automatically rewind to the last committed offset, and start reading again, as if nothing happened.

Example of idempotent record - (id, null) is a delete event; processing the same should do nothing because that ID would already be gone in your downstream systems. But, if you rewind to see (id, data), it would try to upsert that event again until seeing (id, null) again.