I use Datastream to transfer data from PostgreSQL to Cloud Storage. In the documentation it is written that there can be overlap between backfill and CDC, resulting in duplicate events. Event metadata should be used to remove duplicates. In article about events suggested to use uuid field to find duplicates.
I tried to find events with similar uuid, but turned out events from backfill have same uuid. How to find and remove duplicated events if there are any?
To remove duplicates between backfill events and CDC events in Cloud Storage, we need to use primary key columns instead of Datastream UUID.
Backfill events have only INSERT
operations. To find duplicates, we need to group all backfill and CDC events by primary keys and search for more than one INSERT operation per primary keys set. If found, and between timestamps of these operations there aren't DELETE operations, then these are duplicates. Any of them can be dropped.