Search code examples
queuemessage-queueapache-pulsar

Apache Pulsar message delivery semantics


I went through Apache Pulsar Documentation for Message Delivery Semantics. The delivery semantics mentioned for Apache functions(atleast once, atmost once and effective once), If we don't use Apache functions then what are all the different Delivery Semantics available?


Solution

  • TL;DR: Today, neither Pulsar Functions, Pulsar+Spark (you will see duplicates), nor Pulsar+Flink (you will see duplicates) support effectively-once semantics aka exactly-once semantics. Only in certain edge cases you can manually implement such semantics with a DIY setup. What Pulsar does support today are (1) at-most-once semantics = you may lose data and (2) at-least-once semantics = you will not lose data but may see duplicates.

    Regarding (3) effectively-once support: I can certainly imagine that you have been confused. Despite claims in the Pulsar documentation to support effectively-once semantics, and several (misleading, unfortunately) blog articles on the subject (example), Pulsar in fact does not support this. What Pulsar does support is an idempotent producer and deduplication of messages. This functionality is indeed required but -- and this is the important aspect -- not sufficient for exactly-once semantics. The current functionality only works when producing one message and to only one partition. For example, you cannot atomically produce multiple messages to one partition with Pulsar today, let alone multiple partitions. It also means that interaction with state (e.g., for aggregating data like counting, performing joins between data streams) is not exactly-once.

    What's missing, and when will Pulsar support exactly-once semantics? To guarantee exactly-once semantics, Pulsar must first add support for transactions. And this is indeed a planned feature with an original ETA for Pulsar 2.6.0 released in June 2020, but as of today there is still a lot of work left to be done. I am not aware of an updated ETA I'm afraid.

    Where to learn more: A good Pulsar-specific source to understand this in more detail is the Dec 2019 presentation Apache Pulsar: Transactions Preview by Pulsar committers that summarizes the current lack of exactly-once support and explains why support for transactions in Pulsar is required to achieve it.

    Another good source to understand this tricky subject in general is this 3-part article series on how exactly-once semantics are provided by Apache Kafka (blog series part1, part2, part3), which is a technology similar to Apache Pulsar. The series explains why idempotent producers are just one piece of the puzzle, and why transactions are needed (which utilize the former), and how this was designed and implemented in Apache Kafka, and released back in 2017. That's why you benefit from exactly-once semantics when processing data in Kafka with e.g. Kafka Streams (included in Kafka) or with Kafka and Apache Flink. If you look at Pulsar's plans and roadmap in 2020 to introduce exactly-once support, you can clearly see the very close parallels to Kafka's approach. As a user, the notable difference is that Kafka released all the functionality in one go (which also explains why it took the Kafka community several years to design, build, and test the feature), rather than piece-by-piece, which has made it much clearer to understand what is actually supported vs. what is not.