Search code examples
google-bigquerygoogle-cloud-pubsub

Duplicate rows in BigQuery using Pub/Sub subscription


I'm writing a python script that generates N distinct messages that I am publishing to a topic on Google Cloud Pub/Sub using BatchSettings(max_messages=1_000, max_bytes=1_000_000, max_latency=1). Each message has a unique identifier, either a uuid.uuid4 or an integer from an increasing sequence.

The aforementioned topic has a BigQuery Pub/Sub subscription that writes messages to a table with a given Pub/Sub schema.

When start pushing messages at a rate of 800000 per 20s (40000/s). I observe duplicate rows in the BQ table. Even at a rate of 10000/s, duplication is observed.

The BQ Pub/Sub subscription has the following settings:

Use topic schema : Enabled
Write metadata : Disabled
Drop unknown fields : Disabled
Subscription expiration : Subscription will never expire.
Acknowledgement deadline : 600 seconds
Subscription message retention duration : 7 days
Retain acknowledged messages : No
Exactly once delivery : Disabled (cannot be enabled)
Message ordering: Disabled
Dead lettering: Disabled
Retry policy : Retry after exponential backoff delay
Minimum backoff duration : 60 seconds
Maximum backoff duration : 600 seconds

Is this to be expected? What can I do in order to avoid ending up with duplicate rows?

Thank you for your time.


Solution

  • Some level of duplication is expected as Pub/Sub has at-least-once delivery semantics, including for BigQuery subscriptions. Typically, we expect the duplicate rate to be around 0.1%.

    The first question with regard to duplicates is, are they generated on the publish side or are the on the subscribe side (when writing from Pub/Sub to BigQuery)? To answer this question, enable the write_metadata property on your subscription. Then, look at the uniqueness of the message_ids relative to the uniqueness of the identifier you are putting in the messages. If you have multiple messages with different message_ids, but the same identifier you put in the message, then the duplicates are coming on the publish side. If you have messages with the same message_id, then the duplicates are coming on the subscribe side and fit within the expectation of the at-least-once delivery guarantees provided by Pub/Sub.

    If the duplicates are generated on the publish side, then it's possible your publish load is overloading the client in some way and resulting in a lot of deadline exceeded errors, which causes reattempts at publishing. You could consider taking advantage of publisher flow control. By setting PublishFlowControl.message_limit, you can restrict the number of messages awaiting a publish response at a time, which limits the resources needed for publishing and makes it less likely that you will run up against resource constraints that could lead to increased rates of errors and duplicates.