The PubsubIO allows deduplicating messages based on the id attribute:
PubsubIO.readStrings().fromSubscription(pubSubSubscription).withIdAttribute("message_id"))
For how long does Dataflow remember this id? Is it documented anywhere?
It is documented, however it has not yet been migrated to the V2+ version of the docs. The information can still be found in the V1 docs:
https://cloud.google.com/dataflow/model/pubsub-io#using-record-ids
"If you've set a record ID label when using PubsubIO.Read, when Dataflow receives multiple messages with the same ID (which will be read from the attribute with the name of the string you passed to idLabel), Dataflow will discard all but one of the messages. However, Dataflow does not perform this de-duplication for messages with the same record ID value that are published to Cloud Pub/Sub more than 10 minutes apart."