I'm writing a python
script that generates N
distinct messages that I am publishing to a topic on Google Cloud Pub/Sub
using BatchSettings(max_messages=1_000, max_bytes=1_000_000, max_latency=1)
.
Each message has a unique identifier, either a uuid.uuid4
or an integer
from an increasing sequence.
The aforementioned topic has a BigQuery Pub/Sub subscription that writes messages to a table with a given Pub/Sub schema
.
When start pushing messages at a rate of 800000
per 20s
(40000/s
). I observe duplicate rows in the BQ table. Even at a rate of 10000/s
, duplication is observed.
The BQ Pub/Sub subscription has the following settings:
Use topic schema : Enabled
Write metadata : Disabled
Drop unknown fields : Disabled
Subscription expiration : Subscription will never expire.
Acknowledgement deadline : 600 seconds
Subscription message retention duration : 7 days
Retain acknowledged messages : No
Exactly once delivery : Disabled (cannot be enabled)
Message ordering: Disabled
Dead lettering: Disabled
Retry policy : Retry after exponential backoff delay
Minimum backoff duration : 60 seconds
Maximum backoff duration : 600 seconds
Is this to be expected? What can I do in order to avoid ending up with duplicate rows?
Thank you for your time.
Some level of duplication is expected as Pub/Sub has at-least-once delivery semantics, including for BigQuery subscriptions. Typically, we expect the duplicate rate to be around 0.1%.
The first question with regard to duplicates is, are they generated on the publish side or are the on the subscribe side (when writing from Pub/Sub to BigQuery)? To answer this question, enable the write_metadata
property on your subscription. Then, look at the uniqueness of the message_id
s relative to the uniqueness of the identifier you are putting in the messages. If you have multiple messages with different message_id
s, but the same identifier you put in the message, then the duplicates are coming on the publish side. If you have messages with the same message_id
, then the duplicates are coming on the subscribe side and fit within the expectation of the at-least-once delivery guarantees provided by Pub/Sub.
If the duplicates are generated on the publish side, then it's possible your publish load is overloading the client in some way and resulting in a lot of deadline exceeded errors, which causes reattempts at publishing. You could consider taking advantage of publisher flow control. By setting PublishFlowControl.message_limit
, you can restrict the number of messages awaiting a publish response at a time, which limits the resources needed for publishing and makes it less likely that you will run up against resource constraints that could lead to increased rates of errors and duplicates.