While trying to implement exactly-once semantics, I found this in the official Kafka documentation:
Exactly-once delivery requires co-operation with the destination storage system but Kafka provides the offset which makes implementing this straight-forward.
Does this mean that I can use the (topic, partiton, offset) tuple as a unique primary identifier to implement deduplication? An example implementation would be to use an RDBMS and this tuple as a primary key for an insert operation within a big processing transaction where the transaction fails if the insertion is not possible anymore because of an already existing primary key.
I think the question is equivalent to:
If the offset is reused when retrying, consumers obviously see multiple messages with the same offset. Other question, maybe somehow related:
Another possibility could be that the offset is determined e.g. solely by or as recently as the message reaches the leader which does the job (implying that - if not listening to something like a producer's suggested offset - there are probably no gaps/offset jumps, but also different offsets for duplicate messages and I would have to use my own unique identifier within the application's message on application level).
To answer my own question:
The offset is generated solely by the server (more precisely: by the leader of the corresponding partition), not by the producing client. It is then sent back to the producer in the produce response. So:
- Does a producer use the same offset for a message when retrying to send it after detecting a possible failure or does every retry attempt get its own offset?
No. (See update below!) The producer does not determine offsets and two identical/duplicate application messages can have different offsets. So the offset cannot be used to identify messages for producer deduplication purposes and a custom UID has to be defined in the application message. (Source)
- With single or multiple producers producing to the same topic, can there be "gaps" in the offset number sequence seen by one consumer?
Due to the fact that there is only a single leader for every partition which maintains the current offset and the fact that (with the default configuration) this leadership is only transfered to active in-sync replica in case of a failure, I assume that the latest used offset is always communicated correctly when electing a new leader for a partition and therefore there are should not be any offset gaps or jumps initially. However, because of the log compaction feature, there are cases (assuming log compaction being enabled) where there can indeed be gaps in a stream of offsets when consuming already committed messages of a partition once again after the compaction has kicked in. (Source)
Starting from Kafka version 0.11.0, producers now additionally send a sequence number with their requests, which is then used by the leader to deduplicate requests by this number and the producer's ID. So with 0.11.0, the precondition on the producer side for implementing exactly once semantics is given by Kafka itself and there's no need to send another unique ID or sequence number within the application's message. Therefore, the answer to question 1 could now also be yes, somehow. However, note that exactly once semantics are still only possible with the consumer never failing. Once the consumer can fail, one still has to watch out for duplicate message processings on consumer side.