I am using cloud storage notifications with pub/sub in my streaming pipeline
I read documentation about delivery semantic of cloud notifications and it says that it supports at least once delivery semantic and it doesn't guarantee delivery events in the same order as objects was uploaded (as I understand it means that I can get several events with the same generations. Am I right?).
Notifications are not guaranteed to be published in the order Pub/Sub receives them.
Pub/Sub also offers at-least-once delivery to the recipient, which means that you could receive multiple messages, with multiple IDs, that represent the same Cloud Storage event.
I wrote stateful DoFn in Apache Beam with keeping state of the latest largest processed generation to be able to find out of order received generations or duplicated. I tested it via uploading objects to cloud storage one at three seconds, but I din't catch any duplicated events or out of order generations.
My question is which data volume or data velocity should be to be able to catch duplicated events or out of order generations?
Personally I would not try the exercise you are asking for.
Reason is that you may never catch such events during your tests, btw those events may happen in production. And, the other way around.. you may see them in test and they may never occur in prod.
That's how it's designed, those duplicates may be very rare, depending on pub/sub running status, usage, network traffic etc.
You just need to accomodate that behavior, by making your event handler's logic idempotent.
Also, have a look to pub/sub release news.. they have recently introduced "exactly-one-delivery" feature (maybe still in beta).