Search code examples
apache-kafkaapache-kafka-streams

Will Kafka Streams guarentee at-least once processing in stateful processors even when Eaxctly-once is disabled?


This question comes in mind as we are running kafka streams applications without EOS enabled due to infra constraints. We are unsure of its behavior when doing some custom logic using transformer/processor API with changeloged state stores .

Say we are using following topology to de-duplicate records before sending to downstream:

[topic] -> [flatTransformValues + state store] -> [...(downstream)]

the transformer here will compare incoming records against the state store and only forward + update the record when there's a value change, so for messages [A:1], [A:1], [A:2], we expect downstream will only get [A:1], [A:2]

Question is when failures happens, is it possible that [A:2] get stored in the state store's changelog, while downstream does not receive the message, so that any retry reading [A:2] will discard the record and its lost forever?

If not, please tell me if any mechanism prevent this happening, one way i think it could work is if kafka stream produce to changelog topics and commit offsets only after produce to downstream succeeds?

Much appreciated!


Solution

  • Question is when failures happens, is it possible that [A:2] get stored in the state store's changelog, while downstream does not receive the message, so that any retry reading [A:2] will discard the record and its lost forever?

    Yes, that's possible. At-least-once only guarantees that the event will be re-read and re-processed. But for your case, the modified state would modify the second processing and detect the event as a duplicate.

    In the end, it does not make sense anyway, to write a de-duplication program and not use exactly-once guarantees anyways. Even if you could prevent the scenario you describe, using at-least-once processing could introduce duplicates by itself...