Search code examples
publish-subscribegoogle-cloud-pubsub

Alternative for apache beam PubSub read withIdAttribute in normal pubsub client library


In beam sdk, pubusbIO read provides an option to deduplicate messages by using message id: https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html#withIdAttribute-java.lang.String-

When I checkout Pubsub client libs (for java and python), I don't see there is a similar option for using message id to deduplicate messages.

So my questions are:

  1. Do pubsub client libs (python and java) have similar functionality? Perhaps I missed it because of different naming.
  2. If they don't, how are you handling this situation? I'm just curious how it is solved as an inspiration. Cause I'm thinking about using a cache to store most recent message ids for deduplication purpose in my client application.

Thank you.


Solution

  • There isn't the same feature in the PubSub client library. Cloud Dataflow, that run Beam pipeline, keep a cache of the latest messageIds (I don't know how many and how many time, but it's only few minutes). It's a Beam feature.

    When you use PubSub, and because PubSub guaranty only at-least-one-delivery, it's recommended to have your process idempotent

    In general, accommodating more-than-once delivery requires your subscriber to be idempotent when processing messages.