Search code examples
apache-kafkastreamstreamingapache-kafka-streamsconfluent-platform

Kafka Stream delivery semantic for a simple forwarder


I got a stateless Kafka Stream that consumes from a topic and publishes into a different queue (Cloud PubSub) within a forEach. The topology does not end on producing into a new Kafka topic.

How do I know which delivery semantic I can guarantee? Knowing that it's just a message forwarder and no deserialisation or any other transformation or whatsoever is applied: are there any cases in which I could have duplicates or missed messages?

I'm thinking about the following scenarios and related impacts on how offsets are commited:

  • Sudden application crash
  • Error occurring on publish

Thanks guys


Solution

  • If You consider the kafka to kafka loop that a Kafka Stream application usually creates, setting the property:

    processing.guarantee=exactly_once
    

    It's enough to have exactly-once semantic, of course also in failure scenarios.

    Under the hood Kafka uses a transaction to guarantee that the consume - process - produce - commit offset processing is executed with all or nothing guarantee.

    Writing a sink connector with exaclty once semantic kafka to Google PubSub, would mean solving the same issues Kafka solves already for the kafka to kafka scenario.

    1. The producer.send() could result in duplicate writes of message B due to internal retries. This is addressed by the idempotent producer and is not the focus of the rest of this post.
    2. We may reprocess the input message A, resulting in duplicate B messages being written to the output, violating the exactly once processing semantics. Reprocessing may happen if the stream processing application crashes after writing B but before marking A as consumed. Thus when it resumes, it will consume A again and write B again, causing a duplicate.
    3. Finally, in distributed environments, applications will crash or—worse!temporarily lose connectivity to the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics. We call this the problem of “zombie instances.”

    Assuming your producer logic to Cloud PubSub does not suffer from problem 1, just like Kafka producers when using enable.idempotence=true, you are still left with problems 2 and 3.

    Without solving these issues your processing semantic will be the delivery semantic your consumer is using, so at least once, if you choose to manually commit the offset.